[R] Quicker way of combining vectors into a data.frame

Marc Schwartz marc_schwartz at comcast.net
Thu Nov 30 20:41:49 CET 2006


On Thu, 2006-11-30 at 19:26 +0000, Gavin Simpson wrote:
> On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:
> 
> Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
> for your comments and suggestions.
> 
> I noticed that two of the vectors were named and so I removed the names
> (names(vec) <- NULL) and that pushed the execution time for the function
> from c. 40 seconds to c. 115 seconds and all the time was taken within
> the data.frame(...) call. So having names *on* some of the vectors
> seemed to help things along, which was the opposite of what i had
> expected.
> 
> If I use the cbind method of Marc, then the execution time for the
> function drops to c. 1 second (most of which is in the calculation of
> one of the vectors). So I guess I can work round this now.
> 
> What I find interesting is that:
> 
> test.dat <- rnorm(4471)
> > system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
> test.dat,
> + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
> + col8 = test.dat, col9 = test.dat, col10 = test.dat))
> [1] 0.008 0.000 0.007 0.000 0.000
> 
> Whereas doing exactly the same thing with different data in the function
> gives the following timings:
> 
> system.time(fab <- data.frame(lc.ratio, Q,
> +                      fNupt,
> +                      rho.n, rho.s,
> +                      net.Nimm,
> +                      net.Nden,
> +                      CLminN,
> +                      CLmaxN,
> +                      CLmaxS))
> [1] 173.415   0.260 192.192   0.000   0.000
> 
> Most of that was without a change in memory, but towards the end for c.
> 5 seconds memory use by R increased by 200-300 MB.
> 
> and...
> 
> > system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
> +                      fNupt = fNupt,
> +                      rho.n = rho.n, rho.s = rho.s,
> +                      net.Nimm = net.Nimm,
> +                      net.Nden = net.Nden,
> +                      CLminN = CLminN,
> +                      CLmaxN = CLmaxN,
> +                      CLmaxS = CLmaxS))
> [1]  99.966   0.140 114.091   0.000   0.000
> 
> Again with a slight increase in memory usage in last 5 seconds. So now,
> having stripped the names of two of the vectors (so now all are
> un-named), the un-named version of the data.frame call is almost twice
> as slow as the named data.frame call.
> 
> If I leave the names on the two vectors that had them, I get the
> following timings for those same calls
> 
> > system.time(fab <- data.frame(lc.ratio, Q,
> +                      fNupt,
> +                      rho.n, rho.s,
> +                      net.Nimm,
> +                      net.Nden,
> +                      CLminN,
> +                      CLmaxN,
> +                      CLmaxS))
> [1]  96.234   0.244 101.706   0.000   0.000
> 
> > system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
> +                      fNupt = fNupt,
> +                      rho.n = rho.n, rho.s = rho.s,
> +                      net.Nimm = net.Nimm,
> +                      net.Nden = net.Nden,
> +                      CLminN = CLminN,
> +                      CLmaxN = CLmaxN,
> +                      CLmaxS = CLmaxS))
> [1] 13.597  0.088 15.868  0.000  0.000
> 
> So having the 2 named vectors and using the named version of the
> data.frame call is the fastest combination.
> 
> This is all done within the debugger at the time when I would be
> generating fab, and if I do,
> 
> system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
> test.dat,
> + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
> + col8 = test.dat, col9 = test.dat, col10 = test.dat))
> [1] 0.008 0.000 0.007 0.000 0.000
> 
> (as above) at this point in the debugger it is exceedingly quick.
> 
> I just don't understand what is going on with data.frame.
> 
> I have yet to try Prof. Ripley's suggestion of being a bit naughty with
> R - I'll see if that is any quicker.
> 
> Once again, thanks to you all for your suggestions.

Gavin,

Can you post the results of:

  str(fab)

and

  str(lc.ratio)
  str(Q)
  str(fNupt)
  str(rho.n)
  str(rho.s)
  str(net.Nimm)
  str(net.Nden)
  str(CLminN)
  str(CLmaxN)
  str(CLmaxS)

This is taking way too long. There is either something about one or more
of these objects that is more complex than just being simple vectors, or
there is something corrupt in your R session/environment.

You might want to try running a new and clean R session using:

  R --vanilla

and then re-run your code to see if that changes anything.  If so, it
suggests that my latter idea may be in play.

HTH,

Marc




More information about the R-help mailing list