[Rd] Light-weight data.frame class: was: how to add method to .Primitive function

Gabor Grothendieck ggrothendieck at gmail.com
Tue May 10 06:45:50 CEST 2005


"[.default" is implemented in R as .subset.  See ?.subset and note that
it begins with a dot.  e.g. for the case where i and j are not missing:

"[.lwdf" <- function(x, i, j) lapply(.subset(x,j), "[", i)



On 5/8/05, Vadim Ogranovich <vograno at evafunds.com> wrote:
> Hi,
> 
> Encouraged by a tip from Simon Urbanek I tried to use the S3 machinery
> to write a faster version of the data.frame class.
> This quickly hits a snag: the "[.default"(x, i) for some reason cares
> about the dimensionality of x.
> In the end there is a full transcript of my R session. It includes the
> motivation for writing the class and the problems I have encountered.
> 
> As a result I see three issues here:
> * why "[.default"(x, i) doesn't work if dim(x) is 2? After all a single
> subscript into a vector works regardless of whether it's a matrix or
> not. Is there an alternative way to access "[.default"?
> * why does unclass() make deep copy? This is a facet of the global
> over-conservatism of R with respect to copying.
> * is it possible to add some sort copy profiling to R? Something like
> copyProfiling(TRUE), which should cause R to log sizes of each copied
> object (just raw sizes w/o any attempt to identify the object). This
> feature should at least help assess the magnitude of the problem.
> 
> Thanks,
> Vadim
> 
> Now the transcript itself:
> > # the motivation: subscription of a data.frame is *much* (almost 20
> times) slower than that of a list
> > # compare
> > n = 1e6
> > i = seq(n)
> >
> > x = data.frame(a=seq(n), b=seq(n))
> > system.time(x[i,], gcFirst=TRUE)
> [1] 1.01 0.14 1.14 0.00 0.00
> >
> > x = list(a=seq(n), b=seq(n))
> > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE)
> [1] 0.06 0.00 0.06 0.00 0.00
> >
> >
> > # the solution: define methods for the light-weight data.frame class
> > lwdf = function(...) structure(list(...), class = "lwdf")
> >
> > # dim
> > dim.lwdf = function(x) c(length(x[[1]]), length(x))
> >
> > # for pretty printing we define print.lwdf via a conversion to
> data.frame
> > # as.data.frame.lwdf
> > as.data.frame.lwdf = function(x) structure(unclass(x),
> class="data.frame", row.names=as.character(seq(nrow(x))))
> >
> > # print
> > print.lwdf = function(x) print.data.frame(as.data.frame.lwdf(x))
> >
> > # now the real stuff
> >
> > # "["
> > # the naive "[.lwdf" = function (x, i, j) lapply(x[j], function(col)
> col[i])
> > # won't work because evaluation of x[j] calls "[.lwdf" again and not
> "[.default"
> > # so we switch by the number of arguments
> > "[.lwdf" = function (x, i, j) {
> +   if (nargs() == 2)
> +     NextMethod("[", x, i)
> +   else
> +     structure(lapply(x[j], function(col) col[i]),  class = "lwdf")
> + }
> >
> > x = lwdf(a=seq(3), b=letters[seq(3)], c=as.factor(letters[seq(3)]))
> > i = c(1,3); j = c(1,3)
> >
> > # unfortunately, for some reasons "[.default" cares about
> dimensionality of its argument
> > x[i,j]
> Error in "[.default"(x, j) : incorrect number of dimensions
> >
> >
> > # we could use unclass to get it right
> > "[.lwdf" = function (x, i, j) {
> +   structure(lapply(unclass(x)[j], function(col) col[i]),  class =
> "lwdf")
> + }
> >
> > x[i,j]
>  a c
> 1 1 a
> 2 3 c
> >
> > # *but* unclass creates a deep copy of its argument as indirectly
> evidenced by the following timing
> > x = lwdf(a=seq(1e6)); system.time(unclass(x))
> [1] 0.01 0.00 0.01 0.00 0.00
> > x = lwdf(a=seq(1e8)); system.time(unclass(x))
> [1] 0.44 0.39 0.82 0.00 0.00
> 
> > version
>         _
> platform x86_64-unknown-linux-gnu
> arch     x86_64
> os       linux-gnu
> system   x86_64, linux-gnu
> status
> major    2
> minor    0.1
> year     2004
> month    11
> day      15
> language R
> 
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list