[Rd] Doing the right amount of copy for large data frames.

Mon Apr 14 16:11:08 CEST 2008

Gopi Goswami wrote:
> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators using setMethod( )
> and setReplaceMethod( ).
>
>
> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea? Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>
>
>   
A short --- although crass --- reply is that you should not meddle with
this until you know _exactly_ what you are doing....

Two main points are that (a) copying of dataframes in principle only
copies pointers to each variable, until the actual contents are modified
and (b) breaking copy-on-modify (and consequently effectively also break
pass-by-value) semantics is a source of unhappiness.

R does duplicate rather more than it needs to, but the main reason
probably lies in its rudimentary reference tracking (the NAMED entry in
the object header structure). Some of us do wish we could try and fix
this at some point, but it would be a major undertaking. (There are a
zillion places where we'd need to do extra housekeeping rather than let
the garbage collector tidy up after us. Also, reference-counting
solutions from other computer languages do not apply because R can have
circular references.)

> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow = 'numeric', ncol =
> 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     .Object at store <- new.env(hash = TRUE)
>     assign('data', as.list(.Object at data), .Object at store)
>     .Object at nrow <- nrow(.Object at data)
>     .Object at ncol <- ncol(.Object at data)
>     .Object at data <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>   

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907