[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply

Brian.J.GREGOR@odot.state.or.us Brian.J.GREGOR at odot.state.or.us
Thu Apr 8 23:12:39 CEST 2004

First, here's the problem I'm working on so you understand the context. I
have a data frame of travel activity characteristics with 70,000+ records.
These activities are identified by unique chain numbers. (Activities are
part of trip chains.) There are 17,500 chains. 

I use the chain numbers as factors to split various data fields into lists
of chain characteristics with each element of the list representing one
chain. For example:

> betaHomeDist[1:3]
     1316      2319      2317      1364      1316 
 0.000000 14.930820 24.431210  6.174959  0.000000 

     1316      2319      2319      1364      1316 
 0.000000 14.930820 14.930820  6.174959  0.000000 

     1316      1364      2324      1364      1316 
 0.000000  6.174959 14.392375  6.174959  0.000000 

Where each element of the list is a named vector. Each vector element is
named with the zone that the activity occurred within. I use these names in
subsequent computations.

What I've found, however, is that it is not easy (or I have not found the
easy way) to split a named vector into a list that retains the vector names.
For example, splitting an unnamed vector (70,000+) based on the chain
numbers takes very little time:
> system.time(actTimeList <- split(actTime, chainId))
[1] 0.16 0.00 0.15   NA   NA

But if the vector is named, R will work for minutes and still not complete
the job:
> names(actTime) <- zoneNames
> system.time(actTimeList <- split(actTime, chainId))
Timing stopped at: 83.22 0.12 84.49 NA NA

The same thing happens with using tapply with a named vector such as:
tapply(actTime, chainId, function(x) x)

Using the following function with a for loop accomplishes the job in a few
seconds for all 70,000+ records: 
> splitWithNames <- function(dataVector, nameVector, factorVector){
+     dataList <- split(dataVector, factorVector)
+     nameList <- split(nameVector, factorVector)
+     listLength <- length(dataList)
+     namedDataList <- list(NULL)
+     for(i in 1:listLength){
+         x <- dataList[[i]]
+         names(x) <- nameList[[i]]
+         namedDataList[[i]] <- x
+         }
+     namedDataList
+     }
> system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId))
[1] 8.04 0.00 9.03   NA   NA

However if I rewrite the function to use mapply instead of a for loop, it
again takes a long (undetermined) amount of time to complete. Here are the
results for just 5000  and 10000 records. You can see that there is a
scaling issue:
> testfun <- function(dataVector, nameVector, factorVector){
+     dataList <- split(dataVector, factorVector)
+     nameList <- split(nameVector, factorVector)
+     nameFun <- function(x, xNames){
+         names(x) <- xNames
+         x
+         }
+     mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE)
+     }
> system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000],
[1] 2.99 0.00 2.98   NA   NA
> system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000],
[1] 10.64  0.00 10.64    NA    NA

My problem is solved for now with the home-brew splitWithNames function, but
I'm curious about why named vectors slow down split and tapply so much and
why a function using mapply is so much slower than a function that uses a
for loop?

My computer is a 800+ MHz Pentium III with 512 Mb of memory. The operating
system is Windows NT 4.0. My R version is 1.8.1.

Thank you.

Brian Gregor, P.E.
Transportation Planning Analysis Unit
Oregon Department of Transportation
Brian.J.GREGOR at odot.state.or.us
(503) 986-4120

More information about the R-help mailing list