[R] Using large datasets: can I overload the subscript operator?

Gabor Grothendieck ggrothendieck at gmail.com
Sat Mar 10 14:21:39 CET 2007


On 3/9/07, Maciej Radziejewski <maciej.rhelp at gmail.com> wrote:
> Hello,
>
> I do some computations on datasets that come from climate models. These data
> are huge arrays, significantly larger than typically available RAM, so they
> have to be accessed row-by-row, or rather slice-by slice, depending on the
> task. I would like to make an R package to easily access such datasets
> within R. The C++ backend is ready and being used under Windows/.Net/Visual
> Basic, but I have yet to learn the specifics of R programming to make a good
> R interface.
>
> I think it should be possible to make a package (call it "slice") that could
> be used like this:
>
> library (slice)
> dataset <- load.virtualarray ("dataset_definition.xml")
> ordinaryvector <- dataset [ , 2, 3] # Load a portion of the data from disk
> and extract it
>
> In the above "dataset" is an object that holds a definition of a
> 3-dimensional large dataset, and "ordinaryvector" is an ordinary R vector.
> The subscripting operator fetches necessary data from disk and extracts a
> required slice, taking care of caching and other technical details. So, my
> questions are:
>
> Has anyone ever made a similar extension, with virtual (lazy) arrays?

Not quite the same but you might look at the g.data delayed data package
in case its good enough for your needs.  Note the dot.  gdata without a dot
is a different package.

>
> Can the suscript operator be overloaded like that in R? (I know it can be in
> S, at least for vectors.)

Yes.  You make your objects a class, myclass, and then define
"[.myclass" <- function...
 for myclass in the S3 class system and similarly in S4.  S3 is easier
to develop for and has higher performance so you probably want that
rather than S4.

A few examples packages are XML (see "[.XMLNode"), fame and zoo for
S3 and 'its' for S4.

Be sure to check out
?.subset
See think post for context:
http://tolstoy.newcastle.edu.au/R/devel/05/05/0853.html

>
> And a tough one: is it possible to make an expression like "[1]" (without
> quoutes) meaningful in R? At the moment it results in a syntax error. I
> would like to make it return an object of a special class that gets
> interpreted when subscripting my virtual array as "drop this dimension",
> like this:
>
> dataset [, 2, 3, drop = F]  # Return a 3-dimensional array
> dataset [, [2], 3, drop = F]  # Return a 2-dimensional array
> dataset [, [2], [3], drop = F]  # Return a 1-dimensional array, like dataset
> [, 2, 3]

No but one idea is to define the single letter . (i.e. a dot) to be of a special
class, dot say and define "[.dot" to produce objects of a special
class (maybe also
"dot").   Then you could write dataset[, .[2], .[3], drop = FALSE] if
you define "[.myclass" to look for such objects.

Another possibility is to use formula notation:

dataset[, ~2, ~3, drop = FALSE]

and have [.myclass handle formula arguments specially of perhaps forget
about that notation and just extend drop:

dataset[drop = 2:3]

BTW, its better to use FALSE rather than F since F can be a variable name.



More information about the R-help mailing list