[R] object size vs. file size

Gabor Grothendieck ggrothendieck at gmail.com
Tue Mar 28 21:59:56 CEST 2006


Note that formulas have environments too.  Do you have any of
those?

On 3/28/06, Steven Lacey <slacey at umich.edu> wrote:
> Duncan,
>
> I wrote an R package to process my data. The package was written in such a
> way that I no longer stored functions themselves in my "sa" objects, just
> their names (as strings) instead. I re-ran my analysis and found that,
> indeed the saved object sizes were smaller when I was not saving attached
> environments. However, I still find the object size discrepancy. That is, I
> have two objects tmp and tmp1 that are the same size in R (when calling
> object.size both are 870116 bytes), but vastly different sizes as save
> objects (tmp = 1091KB, tmp1=8436KB).
>
> While saving the environment is an issue in overall size, I am not sure it
> accounts for the difference in size. I am beginning to think it has to do
> with the code used to generate the objects.
>
> To do the fitting (which creates tmp and tmp1 objects):
>
> 1) d.rt <- split a dataframe
> 2) define a list called arg, which defines all the parameters for the
> fitting
>
> My problem is that I need to call the function that does the fitting (df2sa)
> once for each dataframe in the list d.rt with the parameters specificed in
> arg. To do this I add two additional components to arg list:
> Arg$X <- d.rt
> Arg$FUN <- "df2sa.models" #This function manages the fitting for each
> dataframe in d.rt.
>
> Now I call:
> Do.call("lapply",arg)
> I expect it to call df2sa for each dataframe in d.rt passing in the
> remaining parameters in the arg list. The code "works" in the sense that I
> get the returned objects, but when I save them the sizes are strange, as
> described above.
>
> I obtain the "small" version of the same object when I call:
> tmp <- do.call(df2sa,arg).
>
> In this case there is no lapply wrapper. Somehow lapply is adding something
> more to what is returned, but I am not sure what or how. What is also
> strange is that the object in question is not the last element in d.rt, so
> it's not as if lapply is returning everything in that one object.
>
> I attached the object files again and the class definitions required to view
> them. However, note that the object names differ from the ones used above.
>
> Tmp = incompat
> Tmp1 = x0302.incompatible.RT.fits
>
> Please help!
>
> Thanks,
> Steve
>
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
> Sent: Sunday, March 26, 2006 10:34 AM
> To: Gabor Grothendieck
> Cc: Steven Lacey; r-help at stat.math.ethz.ch
> Subject: Re: [R] object size vs. file size
>
>
> On 3/25/2006 10:16 PM, Gabor Grothendieck wrote:
> > You can place functions in lists or environments and pass the
> > environment to the function and have it look there first. That way you
> > can have different versions of a function with the same name.
> >
> > 1. Here is an example using lists:
> >
> > A <- list(f = sin)
> > B <- list(f = cos)
> > f <- function(x) x+2
> >
> > myfun <- function(x, L = NULL) with(L, f)(x)
> >
> > myfun(0) # 2
> > myfun(0, A) # 0
> > myfun(0, B) # 1
> >
> > All three of the above make a call to f but the first uses the f in
> > the global environment, the second uses the f in A and the third uses
> > the f in B.
> >
> > 2. Above we illustrated this using lists but it can also be done using
> > environments. In the following we use the proto package to facilitiate
> > this.  proto objects are built on top of environments., For example,
> > you could replace the first two lines in the prior example with:
> >
> > library(proto)
> > A <- proto(f = sin)
> > B <- proto(f = cos)
> >
> > Note that in #1 and #2 myfun did have to be programmed to handle
> > this.   Another way to do this which does not require myfun to be
> > preprogrammed is the following:
> >
> >
> > library(proto)
> > A <- proto(f = sin)
> > B <- proto(f = cos)
> > myfun <- function(x) f(x)
> >
> > myfun(0) # 2
> > with(A$proto(myfun = myfun), myfun)(0) # 0
> > with(B$proto(myfun = myfun), myfun)(0) # 1
> >
> > The first with statement defines a child object of A which contains
> > a single method myfun, A$proto(myfun = myfun).   Then it calls the
> > myfun in that new object.  Since the new object is a child of A, myfun
> > will look for f in the new object and not finding it will search
> > the parent A and find it there.   Similarly for B in the second with
> > statement.
> >
> >
> >
> > Regarding removing environments, if if is a function you can do this:
> >
> > environment(f) <- NULL
> >
> > but you will likely need to restore the environment prior to using f.
>
> That will get you a warning in 2.3.0 (and replace the NULL with
> baseenv()), and an error in 2.4.0.  In current and past versions, a NULL
> wasn't interpreted as "no environment", it was interpreted as the base
> environment.
>
> If you want something that is like "no environment", you can use
> emptyenv() in 2.3.0, but this would rarely make sense for an R function:
>  even the most basic things involved in evaluation need to come from
> somewhere.  emptyenv() is mainly designed for situations where you want
> an entirely separate namespace, not related to R functions at all, but
> using the same syntax and rules for lookups.
>
> Duncan Murdoch
>
> >
> > On 3/25/06, Steven Lacey <slacey at umich.edu> wrote:
> >> Duncan,
> >>
> >> Thanks! This is progress! One solution might be to remove all
> >> environments from the objects that I want to save in the "sa" object,
> >> thereby avoiding the problem of saving environments altogther. But,
> >> can I remove the environment from a function? Does that even make
> >> sense given how R operates under the hood? Even if I could, would the
> >> functions still work?
> >>
> >> Here is my more general problem. As I learn more about R and the
> >> demands made on my code change, I sometimes change a function
> >> referenced by a given name rather than explicitly defining a new
> >> version of that function. This creates a problem when I want to
> >> review how the model stored in the "sa" object was originally
> >> created. If only the function name is stored in the "sa" object, I
> >> won't necessarily know what version was actually called at the time
> >> the model was constructed because I did not rename it. To deal with
> >> this I decided to store the function itself.
> >>
> >> Sounds like this may not be a great idea, or at least comes with
> >> serious trade-offs, particularly as some functions are generic like
> >> the mean. Is there a better way to save a function than to save the
> >> function itself or just its name? For instance, do args() and body()
> >> return an associated environment? I assume I could recreate the
> >> original function from these objects, correct? If so, is there some
> >> easy way to do it?
> >>
> >> Alternatively, are there any version control tools built into R? That
> >> is, is there a way R can keep track of the version for me (as opposed
> >> to explicitly declaring different verions foo<-..., foo.v1<-...,
> >> foo.v2<-...)? I am not sure exactly what I am asking for here. The
> >> more I write the more this seems unreasonable. A new function
> >> requires a new name, right? I just find myself writing lots of new
> >> versions and keeping track of their names, which one does what, and
> >> changing the names in other functions that call them a little
> >> overwhelming. Maybe the way to deal with this is to write different
> >> versions of same package. That way the versions will effect the
> >> naming of and the call to load the package, but not the calls to
> >> individual functions. This way functions can have the same name, but
> >> do different things depending on the package version, not the
> >> function name. However, I have never created a package and would
> >> prefer not to do so in the short-term (my dissertation is due in
> >> August), unless it is fairly straightforward.
> >>
> >> The more I think about it a package is more accurately what I want. I
> >> want to be able to recreate the analysis of my data long after it has
> >> been completed. If I had packages, then I would just need to know
> >> what version of the package was used, load it, and re-run the
> >> analysis. I wouldn't need to store the critical functions in the
> >> object. Where might I find good introduction to writing packages?
> >>
> >> In the short-term would the solution above (using body and args)
> >> work?
> >>
> >> Thanks again,
> >> Steve
> >>
> >>
> >> -----Original Message-----
> >> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
> >> Sent: Saturday, March 25, 2006 5:31 PM
> >> To: Steven Lacey
> >> Cc: r-help at stat.math.ethz.ch
> >> Subject: Re: [R] object size vs. file size
> >>
> >>
> >> On 3/25/2006 7:32 AM, Steven Lacey wrote:
> >>> Hi,
> >>>
> >>> There is rather large discrepancy in the size of the object as it
> >>> lives in R and the size of the object when it is written to the
> >>> disk. The object in question is an S4 of a homemade class "sa". I
> >>> first call a function that writes a list of these objects to a file
> >>> called "data.RData". The size of this file is 14,411 KB. I would
> >>> assume on average then, that each list component--there are 32 sa
> >>> objects in data.RData--would be approximately 450 KB (14,111/32).
> >>> However, when I load the data into R and call object.size on just
> >>> one s4 object (call it tmp) it returns 77496 bytes (77 KB)! What is
> >>> even stranger is that if I save this S4 object alone by calling
> >>> save(tmp, file="tmp.RData"), tmp.RData is 13.3 MB! I understand from
> >>> the help on object.size that the object size is only approximate and
> >>> excludes the space recquired to store its name in the symbol table.
> >>> But, this difference in object size and file size is huge! This
> >>> phenomenon occurs no matter which S4 object I save from data.RData.
> >>>
> >>> Why is the object so big when it is in a file? What else is getting
> >>> stored with it? I have examined the object in R to find additional
> >>> information stored with it, but have not found anything that would
> >>> account for the size of the object in the file system. For example,
> >>>> environment(tmp)
> >>> NULL
> >> I'm not 100% sure where the problem is, but I think it probably does
> >> involve environments.  Your tmp object contains a number of
> >> functions. I think when some function is saved, its environment is
> >> being saved too, and the environment contains much more than you
> >> thought.
> >>
> >> R doesn't normally save a new copy of a package or namespace
> >> environment when it saves a function, nor does it save a complete
> >> copy of .GlobalEnv with every function defined there, but it does
> >> save the environment in some other circumstances.  For example, look
> >> at this code:
> >>
> >>  > f <- function() {
> >> +       notused <- 1:1000000
> >> +       value <- function() 1
> >> +       return(value)
> >> +  }
> >>  >
> >>  >  g <- f()
> >>  >  g
> >> function() 1
> >> <environment: 01B10D1C>
> >>  >  save(g, file='g.RData')
> >>  > object.size(g)
> >> [1] 200
> >>
> >> The g object is 200 bytes or so, but when it is saved, the defining
> >> environment containing that huge "notused" variable is saved with it,
> >> so  g.RData ends up being about 4 Megabytes in size.
> >>
> >> I don't know any function that will help to diagnose where this
> >> happens.  Here's one that doesn't quite work:
> >>
> >> findenvironments <- function(x) {
> >>     e <- environment(x)
> >>     if (is.null(e)) result <- NULL
> >>     else result <- list(e)
> >>     x <- unclass(x)
> >>     if (is.list(x)) {
> >>        for (i in seq(along=x)) {
> >>          contained <- findenvironments(x[[i]])
> >>          if (length(contained)) result <- c(result, contained)
> >>        }
> >>     }
> >>     if (length(result)) browser()
> >>     result
> >> }
> >>
> >> This won't recurse into the slots of an S4 object, so it doesn't
> >> really help you, and I'm not sure how to do that.  But maybe someone
> >> else can fix it.
> >>
> >> Duncan Murdoch
> >>
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide!
> >> http://www.R-project.org/posting-guide.html
> >>
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
>
>
>
>
>
>




More information about the R-help mailing list