R devel question

Peter Dalgaard BSA p.dalgaard@biostat.ku.dk
12 Oct 1998 13:37:04 +0200


David Mosberger <david_mosberger@hp.com> writes:

> Hi Peter,
> 
> Hope you don't mind my asking you this directly (I wasn't sure if it
> would be appropriate to post this question to R-devel;

It would.

> if you think it
> is, please feel free to forward it).

Done. 

> I'm wondering whether there are any plans to extend R so it could
> handle arbitrarily large objects.  For the particular application I
> have in mind, the objects would be several hundred MB in size (in
> uncompressed form).

Some ideas involving "virtual objects" and database interfaces have
been vented on various occasions, but there are no immediate plans. It
*should* happen at some point, I think, but just now we're busy trying
to get the documentation in sync with the implementation and getting
rid of bugs. 

> One way I think this could be handled is to leave the basic data types
> presented by R to the user unchanged, but to offer different
> implementation choices for those user data types.  For example, an
> "array of structures" is presently loaded into memory in its entirety
> when accessed.  For large data objects, this isn't ideal.  An
> alternative implementation would be to store such a big array on disk
> and load into memory only the parts that are really needed.  In other
> words, the incore representation would simply be a cache of the entire
> data object.  Of course, once this is done you could also vary the
> external representation of objects.  For example, instead of storing
> each array element next to each other, it often could be advantageous
> to store the fields of the array next to each other (so that
> operations like "compute the average of the .age field" could be
> performed efficiently).  Yet another variation might be to add
> on-the-fly compression/decompression to minimize the size of the
> external data file.
> 
> If this approach were taken, I'd imagine that R would continue to use
> the "store entirely in memory" approach by default to maintain
> backwards compatibility.  At the same time, a few new functions could
> be introduced that would allow precise control over how the object is
> implemented.  So when the user wants to deal with a large object, it
> would create the object, set its implementation to something suitable
> (e.g., cache-only, field-sequential layout, on-the-fly compression)
> and then continue to use the object as usual.
> 
> Since I'm not familiar with the internals of R, I have no idea how
> easy/hard this would be and I'd therefore appreciate hearing your
> opinion on whether you think this would be a valuable and doable
> extension.
> 
> In any case, thanks for working on R!  I was excited to find that I
> now have to option to use the S language on my Linux systems!
> 
> Cheers,
> 
> 	--david
> 
> -- 
> David Mosberger, Ph.D; HP Labs; 1501 Page Mill Rd MS 1U17; Palo Alto, CA 94304
> davidm@hpl.hp.com               voice (650) 236-2575              fax 857-5100
> 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._