[Rd] Re: [R] Do environments make copies?

Gabor Grothendieck ggrothendieck at myway.com
Sat Feb 26 15:28:46 CET 2005


See ?gctorture

Nawaaz Ahmed <nawaaz <at> inktomi.com> writes:

: 
: Hi Folks,
: Thanks for all your replies and input. In particular, thanks Luke, for 
: explaining what is happening under the covers. In retrospect, my example 
:   using save and load to demonstrate the problem I was having was a 
: mistake - I was trying to reproduce the problem I was having in a simple 
: enough way and I thought save and load were showing the same problem 
: (i.e. an extra copy was being made). After carefully examining my gc() 
: traces,
: I've come to realize that while there are copies being made, there is 
: nothing unexpected about it - the failure to allocate memory is really 
: because R is hitting the 3GB address limit imposed by my linux box 
: during processing. So as Luke suggests, maybe 32 bits is not the right 
: platform for handling large data in R.
: 
: On the other hand, I think the problem can be somewhat alleviated 
: (though not eliminated) if we did garbage collection of temporary 
: variables immediately so that we can reduce the memory footprint and the 
: fragmentation problem that malloc() is going to be faced with 
: (gctorture() is probably too extreme . Most of the problems that I am 
: having  are in the coercion routines which do create temporary copies. 
: So in code of the form x = as.vector(x), it would be nice if the old 
: value of x was garbage collected (i.e. if there were no references to it)
: 
: nawaaz
: 
: Luke Tierney wrote:
: > On Thu, 24 Feb 2005, Berton Gunter wrote:
: > 
: >> I was hoping that one of the R gurus would reply to this, but as they 
: >> have't
: >> (thus far) I'll try. Caveat emptor!
: >>
: >> First of all, R passes function arguments by values, so as soon as you 
: >> call
: >> foo(val) you are already making (at least) one other copy of val for the
: >> call.
: > 
: > 
: > Conceptually you have a copy, but internally R trieas to use a
: > copy-on-modify strategy to avaoid copying unless necessary.  THere are
: > conservative approximations involved, so there is more copying than
: > one might like but definitely not as much as this.
: > 
: > 
: >> Second,you seem to implicitly make the assumption that assign(..., env=)
: >> uses a pointer to point to the values in the environment. I do not 
: >> know how
: >> R handles environments and assignments like this internally, but your 
: >> data
: >> seems to indicate that it copies the value and does not merely point 
: >> to it
: >> (this is where R Core folks can shed more authoritative light).
: > 
: > 
: > This assignment does just store the pointer.
: > 
: >> Finally, it makes perfect sense to me that, as a data structure, the
: >> environment itself may be small even if it effectively points to (one of
: >> several copies of) large objects, so that object.size(an.environment) 
: >> could
: >> be small although the environment may "contain" huge arguments. Again, 
: >> the
: >> details depend on the precise implementation and need clarification by
: >> someone who actually knows what's going on here, which ain't me.
: >>
: >> I think the important message is that you shouldn't treat R as C, and you
: >> shouldn't try to circumvent R's internal data structures and 
: >> conventions. R
: >> is a language designed to implements Chambers's S model of 
: >> "Programming with
: >> Data." Instead of trying to fool R to handle large data sets, maybe you
: >> should consider whether you really **need** all the data in R at one time
: >> and if sensible partitioning or sampling to analyze only a portion or
: >> portions of the data might not be a more effective strategy.
: > 
: > 
: > R can do quite a reasonable job with large data sets on a resonable
: > platform.  A 32 bit platform is not a reasonable one on which to use R
: > with 800 MB chunks of data. Automatic memory management combined with
: > the immutable vector semantics require more elbow room than that.  If
: > you really must use data of this size on a 32-bit platform you will
: > probably be muchhappier using a limited amoutn of C code and external
: > pointers.
: > 
: > As to what is happening in this example: look at the default parent
: > used by new.env and combine that with the fact that the serialization
: > code does not preserve sharing of atomic objects.  The two references
: > to the large object are shared in the original session but lead to two
: > large objects in the saved image and the load.  Using
: > 
: >     ref <- list(env = new.env(parent = .GlobalEnv))
: > 
: > in new.ref avoids the second copy both in the saved image and after
: > loading.
: > 
: > luke
: > 
: >>
: >>> -----Original Message-----
: >>> From: r-help-bounces <at> stat.math.ethz.ch
: >>> [mailto:r-help-bounces <at> stat.math.ethz.ch] On Behalf Of Nawaaz Ahmed
: >>> Sent: Thursday, February 24, 2005 10:36 AM
: >>> To: r-help <at> stat.math.ethz.ch
: >>> Subject: [R] Do environments make copies?
: >>>
: >>> I am using environments to avoid making copies (by keeping
: >>> references).
: >>> But it seems like there is a hidden copy going on somewhere - for
: >>> example in the code fragment below, I am creating a reference to "y"
: >>> (of size 500MB) and storing the reference in object "data".
: >>> But when I
: >>> save "data" and then restore it in another R session, gc()
: >>> claims it is
: >>> using twice the amount of memory. Where/How is this happening?
: >>>
: >>> Thanks for any help in working around this - my datasets are just not
: >>> fitting into my 4GB, 32 bit linux machine (even though my actual data
: >>> size is around 800MB)
: >>>
: >>> Nawaaz
: >>>
: >>> > new.ref <- function(value = NULL) {
: >>> +     ref <- list(env = new.env())
: >>> +     class(ref) <- "refObject"
: >>> +     assign("value", value, env = ref$env)
: >>> +     ref
: >>> + }
: >>> > object.size(y)
: >>> [1] 587941404
: >>> > y.ref = new.ref(y)
: >>> > object.size(y.ref)
: >>> [1] 328
: >>> > data = list()
: >>> > data$y.ref = y.ref
: >>> > object.size(data)
: >>> [1] 492
: >>> > save(data, "data.RData")
: >>>
: >>> ...
: >>>
: >>> run R again
: >>> ===========
: >>>
: >>> > load("data.RData")
: >>> > gc()
: >>>              used   (Mb) gc trigger   (Mb)
: >>> Ncells    141051    3.8     350000    9.4
: >>> Vcells 147037925 1121.9  147390241 1124.5
: >>>
: >>> ______________________________________________
: >>> R-help <at> stat.math.ethz.ch mailing list
: >>> https://stat.ethz.ch/mailman/listinfo/r-help
: >>> PLEASE do read the posting guide!
: >>> http://www.R-project.org/posting-guide.html
: >>>
: >>
: >> ______________________________________________
: >> R-help <at> stat.math.ethz.ch mailing list
: >> https://stat.ethz.ch/mailman/listinfo/r-help
: >> PLEASE do read the posting guide! 
: >> http://www.R-project.org/posting-guide.html
: >>
: >
: 
: ______________________________________________
: R-devel <at> stat.math.ethz.ch mailing list
: https://stat.ethz.ch/mailman/listinfo/r-devel
: 
:



More information about the R-devel mailing list