[R] avoiding too many loops - reshaping data

Bert Gunter gunter.berton at gene.com
Thu Nov 4 05:51:19 CET 2010


Beware of facile comparisons of this sort -- they may be apples and nematodes.

I cannot speak to the others, but (1) tapply does not yield a data
frame and (2) tapply actually **is** a (efficient, disguised) loop (at
the interpreter level, essentially). I suspect what makes it so much
faster is that it avoids the overhead of setting up careful data
structures that the others provide and (2) the underlying summarizing
function is sum(), which does its work at the c, not the interpreted
level. If it were a user function -- maybe mysum <- function(x)sum(x)
-- I suspect the discrepancy might not be so large (try it!)

Naturally, I am prepared to be instructed and corrected on this either
by you or someone wiser on these matters.

-- Bert

On Wed, Nov 3, 2010 at 3:16 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> Here is the summary of methods. tapply is the fastest!
>
> library(reshape)
>
> system.time(for(i in 1:1000)cast(melt(mydf, measure.vars = "value"),
> city ~ brand,fun.aggregate = sum))
>  user  system elapsed
>
>  18.40    0.00   18.44
>
> library(reshape2)
> system.time(for(i in 1:1000)dcast(mydf,city ~ brand, sum))
>  user  system elapsed
>  12.36    0.02   12.37
>
>
> system.time(for(i in 1:1000)xtabs(value ~ city + brand, mydf))
>
>  user  system elapsed
>
>  2.45    0.00    2.47
>
>
> system.time(for(i in 1:1000)tapply(mydf$value,mydf[c('city','brand')],sum))
>
>  user  system elapsed
>
>  0.78    0.00    0.79
>
> Dimitri
>
>
> On Wed, Nov 3, 2010 at 4:32 PM, Henrique Dallazuanna <wwwhsd at gmail.com> wrote:
>> Try this:
>>
>>  xtabs(value ~ city + brand, mydf)
>>
>> On Wed, Nov 3, 2010 at 6:23 PM, Dimitri Liakhovitski
>> <dimitri.liakhovitski at gmail.com> wrote:
>>>
>>> Hello!
>>>
>>> I have a data frame like this one:
>>>
>>>
>>> mydf<-data.frame(city=c("a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b"),
>>>  brand=c("x","x","y","y","z","z","z","z","x","x","x","y","y","y","z","z"),
>>>  value=c(1,2,11,12,111,112,113,114,3,4,5,13,14,15,115,116))
>>> (mydf)
>>>
>>> What I need to get is a data frame like the one below - cities as
>>> rows, brands as columns, and the sums of the "value" within each
>>> city/brand combination in the body of the data frame:
>>>
>>> city x   y    z
>>> a    3   23  336
>>> b    7   42  231
>>>
>>>
>>> I have written a code that involves multiple loops and subindexing -
>>> but it's taking too long.
>>> I am sure there must be a more efficient way of doing it.
>>>
>>> Thanks a lot for your hints!
>>>
>>>
>>> --
>>> Dimitri Liakhovitski
>>> Ninah Consulting
>>> www.ninah.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> Henrique Dallazuanna
>> Curitiba-Paraná-Brasil
>> 25° 25' 40" S 49° 16' 22" O
>>
>
>
>
> --
> Dimitri Liakhovitski
> Ninah Consulting
> www.ninah.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Bert Gunter
Genentech Nonclinical Biostatistics



More information about the R-help mailing list