[R] Quicker way of combining vectors into a data.frame

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Fri Dec 1 12:13:11 CET 2006


Gavin Simpson wrote:
> [ Resending to the list as I fell foul of the too many recipients rule ]
>
> On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:
>
> Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
> for your comments and suggestions.
>
> I noticed that two of the vectors were named and so I removed the names
> (names(vec) <- NULL) and that pushed the execution time for the function
> from c. 40 seconds to c. 115 seconds and all the time was taken within
> the data.frame(...) call. So having names *on* some of the vectors
> seemed to help things along, which was the opposite of what i had
> expected.
>
> If I use the cbind method of Marc, then the execution time for the
> function drops to c. 1 second (most of which is in the calculation of
> one of the vectors). So I guess I can work round this now.
>
> What I find interesting is that:
>
> test.dat <- rnorm(4471)
>   
>> system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
>>     
> test.dat,
> + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
> + col8 = test.dat, col9 = test.dat, col10 = test.dat))
> [1] 0.008 0.000 0.007 0.000 0.000
>
> Whereas doing exactly the same thing with different data in the function
> gives the following timings:
>
> system.time(fab <- data.frame(lc.ratio, Q,
> +                      fNupt,
> +                      rho.n, rho.s,
> +                      net.Nimm,
> +                      net.Nden,
> +                      CLminN,
> +                      CLmaxN,
> +                      CLmaxS))
> [1] 173.415   0.260 192.192   0.000   0.000
>
> Most of that was without a change in memory, but towards the end for c.
> 5 seconds memory use by R increased by 200-300 MB.
>
> and...
>
>   
>> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
>>     
> +                      fNupt = fNupt,
> +                      rho.n = rho.n, rho.s = rho.s,
> +                      net.Nimm = net.Nimm,
> +                      net.Nden = net.Nden,
> +                      CLminN = CLminN,
> +                      CLmaxN = CLmaxN,
> +                      CLmaxS = CLmaxS))
> [1]  99.966   0.140 114.091   0.000   0.000
>
> Again with a slight increase in memory usage in last 5 seconds. So now,
> having stripped the names of two of the vectors (so now all are
> un-named), the un-named version of the data.frame call is almost twice
> as slow as the named data.frame call.
>
> If I leave the names on the two vectors that had them, I get the
> following timings for those same calls
>
>   
>> system.time(fab <- data.frame(lc.ratio, Q,
>>     
> +                      fNupt,
> +                      rho.n, rho.s,
> +                      net.Nimm,
> +                      net.Nden,
> +                      CLminN,
> +                      CLmaxN,
> +                      CLmaxS))
> [1]  96.234   0.244 101.706   0.000   0.000
>
>   
>> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
>>     
> +                      fNupt = fNupt,
> +                      rho.n = rho.n, rho.s = rho.s,
> +                      net.Nimm = net.Nimm,
> +                      net.Nden = net.Nden,
> +                      CLminN = CLminN,
> +                      CLmaxN = CLmaxN,
> +                      CLmaxS = CLmaxS))
> [1] 13.597  0.088 15.868  0.000  0.000
>
> So having the 2 named vectors and using the named version of the
> data.frame call is the fastest combination.
>
> This is all done within the debugger at the time when I would be
> generating fab, and if I do,
>
> system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
> test.dat,
> + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
> + col8 = test.dat, col9 = test.dat, col10 = test.dat))
> [1] 0.008 0.000 0.007 0.000 0.000
>
> (as above) at this point in the debugger it is exceedingly quick.
>
> I just don't understand what is going on with data.frame.
>
>   
I think there is something about the data you're not telling us...

Could you e.g. do something like

str(data.frame(lc.ratio, Q,
                      fNupt,
                      rho.n, rho.s,
                      net.Nimm,
                      net.Nden,
                      CLminN,
                      CLmaxN,
                      CLmaxS))


and

str(list(lc.ratio, Q,
                      fNupt,
                      rho.n, rho.s,
                      net.Nimm,
                      net.Nden,
                      CLminN,
                      CLmaxN,
                      CLmaxS))





-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907




More information about the R-help mailing list