[Rd] Lightweight data frame class

Vadim Ogranovich vograno at evafunds.com
Fri Nov 26 00:31:07 CET 2004


Hi,
 
As far as I can tell data.frame class adds two features to those of
lists:
* matrix structure via [,] and [,]<- operators  (well, I know these are
actually "["(i, j, ...), not "[,]"). 
* row names attribute.
 
It seems that the overhead of the support for the row names, both
computational and RAM-wise, is rather non-trivial. I frequently
subscript from a data.frame, i.e. use [,] on data frames, and my timing
shows that the equivalent list operation is about 7 times faster, see
below.
 
On the other hand, at least in my usage pattern, I really rarely benefit
from the row names attribute, so as far as I am concerned row names is
just an overhead. (Of course the speed difference may be due to other
factors, the only thing I can tell is that subscripting is very slow in
data frames relative to in lists).
 
I thought of writing a new class, say lightweight.data.frame, that would
be polymorphic with the existing data.frame class. The class would
inherit from "list" and implement [,], [,]<- operators. It would also
implement the "rownames" function that would return seq(nrow(x)), etc.
It should also implement as.data.frame to avoid the overhead of
conversion to a full-blown data.frame in calls like lm(y ~ x,
data=myLightweightDataframe).
 
Has anyone thought of this? Can you see any potential problems?
 
Thanks,
Vadim
 
 
 
P.S. These are the timing results comparing data.frame operations to
those of lists

# make a 1e6 * 5 list
> system.time(x <- lapply(seq(5), function(x) rnorm(1e6)))
[1] 4.46 0.10 4.57 0.00 0.00
# convert it to a data.frame
> system.time(y <- as.data.frame(x))
[1] 49.17  1.25 50.61  0.00  0.00
# do an equivalent of x[-1,] on the list
> i <- seq(2, nrow(y)); system.time(x.sub <- lapply(x, function(x)
x[i]))
[1] 0.19 0.15 0.35 0.00 0.00
# do an equivalent of x[-1,] on the data.frame
> i <- seq(2, nrow(y)); system.time(y.sub <- y[i,])
[1] 2.08 0.56 2.64 0.00 0.00
> 2.64/0.35
[1] 7.542857



More information about the R-devel mailing list