[R] object size vs. file size

Tue Mar 28 22:02:08 CEST 2006

On 3/28/2006 2:54 PM, Steven Lacey wrote:
> Duncan, 
> 
> I wrote an R package to process my data. The package was written in such a
> way that I no longer stored functions themselves in my "sa" objects, just
> their names (as strings) instead. I re-ran my analysis and found that,
> indeed the saved object sizes were smaller when I was not saving attached
> environments. However, I still find the object size discrepancy. That is, I
> have two objects tmp and tmp1 that are the same size in R (when calling
> object.size both are 870116 bytes), but vastly different sizes as save
> objects (tmp = 1091KB, tmp1=8436KB). 
> 
> While saving the environment is an issue in overall size, I am not sure it
> accounts for the difference in size. I am beginning to think it has to do
> with the code used to generate the objects. 
> 
> To do the fitting (which creates tmp and tmp1 objects):
> 
> 1) d.rt <- split a dataframe
> 2) define a list called arg, which defines all the parameters for the
> fitting
> 
> My problem is that I need to call the function that does the fitting (df2sa)
> once for each dataframe in the list d.rt with the parameters specificed in
> arg. To do this I add two additional components to arg list:
> Arg$X <- d.rt
> Arg$FUN <- "df2sa.models" #This function manages the fitting for each
> dataframe in d.rt.
> 
> Now I call:
> Do.call("lapply",arg)
> I expect it to call df2sa for each dataframe in d.rt passing in the
> remaining parameters in the arg list. The code "works" in the sense that I
> get the returned objects, but when I save them the sizes are strange, as
> described above. 
> 
> I obtain the "small" version of the same object when I call:
> tmp <- do.call(df2sa,arg).
> 
> In this case there is no lapply wrapper. Somehow lapply is adding something
> more to what is returned, but I am not sure what or how. What is also
> strange is that the object in question is not the last element in d.rt, so
> it's not as if lapply is returning everything in that one object.  
> 
> I attached the object files again and the class definitions required to view
> them. However, note that the object names differ from the ones used above.
> 
> Tmp = incompat
> Tmp1 = x0302.incompatible.RT.fits
> 
> Please help!

Sorry, I can't really help.  I suspect it's still an issue of 
environments, but you'll need to find someone who knows the S4 internals 
better than me to figure out where the environments are hiding.

Duncan Murdoch

> 
> Thanks,
> Steve
> 
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
> Sent: Sunday, March 26, 2006 10:34 AM
> To: Gabor Grothendieck
> Cc: Steven Lacey; r-help at stat.math.ethz.ch
> Subject: Re: [R] object size vs. file size
> 
> 
> On 3/25/2006 10:16 PM, Gabor Grothendieck wrote:
>> You can place functions in lists or environments and pass the 
>> environment to the function and have it look there first. That way you 
>> can have different versions of a function with the same name.
>> 
>> 1. Here is an example using lists:
>> 
>> A <- list(f = sin)
>> B <- list(f = cos)
>> f <- function(x) x+2
>> 
>> myfun <- function(x, L = NULL) with(L, f)(x)
>> 
>> myfun(0) # 2
>> myfun(0, A) # 0
>> myfun(0, B) # 1
>> 
>> All three of the above make a call to f but the first uses the f in 
>> the global environment, the second uses the f in A and the third uses 
>> the f in B.
>> 
>> 2. Above we illustrated this using lists but it can also be done using 
>> environments. In the following we use the proto package to facilitiate 
>> this.  proto objects are built on top of environments., For example, 
>> you could replace the first two lines in the prior example with:
>> 
>> library(proto)
>> A <- proto(f = sin)
>> B <- proto(f = cos)
>> 
>> Note that in #1 and #2 myfun did have to be programmed to handle
>> this.   Another way to do this which does not require myfun to be
>> preprogrammed is the following:
>> 
>> 
>> library(proto)
>> A <- proto(f = sin)
>> B <- proto(f = cos)
>> myfun <- function(x) f(x)
>> 
>> myfun(0) # 2
>> with(A$proto(myfun = myfun), myfun)(0) # 0
>> with(B$proto(myfun = myfun), myfun)(0) # 1
>> 
>> The first with statement defines a child object of A which contains
>> a single method myfun, A$proto(myfun = myfun).   Then it calls the
>> myfun in that new object.  Since the new object is a child of A, myfun 
>> will look for f in the new object and not finding it will search
>> the parent A and find it there.   Similarly for B in the second with
>> statement.
>> 
>> 
>> 
>> Regarding removing environments, if if is a function you can do this:
>> 
>> environment(f) <- NULL
>> 
>> but you will likely need to restore the environment prior to using f.
> 
> That will get you a warning in 2.3.0 (and replace the NULL with 
> baseenv()), and an error in 2.4.0.  In current and past versions, a NULL 
> wasn't interpreted as "no environment", it was interpreted as the base 
> environment.
> 
> If you want something that is like "no environment", you can use 
> emptyenv() in 2.3.0, but this would rarely make sense for an R function: 
>   even the most basic things involved in evaluation need to come from 
> somewhere.  emptyenv() is mainly designed for situations where you want 
> an entirely separate namespace, not related to R functions at all, but 
> using the same syntax and rules for lookups.
> 
> Duncan Murdoch
> 
>> 
>> On 3/25/06, Steven Lacey <slacey at umich.edu> wrote:
>>> Duncan,
>>>
>>> Thanks! This is progress! One solution might be to remove all 
>>> environments from the objects that I want to save in the "sa" object, 
>>> thereby avoiding the problem of saving environments altogther. But, 
>>> can I remove the environment from a function? Does that even make 
>>> sense given how R operates under the hood? Even if I could, would the 
>>> functions still work?
>>>
>>> Here is my more general problem. As I learn more about R and the 
>>> demands made on my code change, I sometimes change a function 
>>> referenced by a given name rather than explicitly defining a new 
>>> version of that function. This creates a problem when I want to 
>>> review how the model stored in the "sa" object was originally 
>>> created. If only the function name is stored in the "sa" object, I 
>>> won't necessarily know what version was actually called at the time 
>>> the model was constructed because I did not rename it. To deal with 
>>> this I decided to store the function itself.
>>>
>>> Sounds like this may not be a great idea, or at least comes with 
>>> serious trade-offs, particularly as some functions are generic like 
>>> the mean. Is there a better way to save a function than to save the 
>>> function itself or just its name? For instance, do args() and body() 
>>> return an associated environment? I assume I could recreate the 
>>> original function from these objects, correct? If so, is there some 
>>> easy way to do it?
>>>
>>> Alternatively, are there any version control tools built into R? That 
>>> is, is there a way R can keep track of the version for me (as opposed 
>>> to explicitly declaring different verions foo<-..., foo.v1<-..., 
>>> foo.v2<-...)? I am not sure exactly what I am asking for here. The 
>>> more I write the more this seems unreasonable. A new function 
>>> requires a new name, right? I just find myself writing lots of new 
>>> versions and keeping track of their names, which one does what, and 
>>> changing the names in other functions that call them a little 
>>> overwhelming. Maybe the way to deal with this is to write different 
>>> versions of same package. That way the versions will effect the 
>>> naming of and the call to load the package, but not the calls to 
>>> individual functions. This way functions can have the same name, but 
>>> do different things depending on the package version, not the 
>>> function name. However, I have never created a package and would 
>>> prefer not to do so in the short-term (my dissertation is due in 
>>> August), unless it is fairly straightforward.
>>>
>>> The more I think about it a package is more accurately what I want. I 
>>> want to be able to recreate the analysis of my data long after it has 
>>> been completed. If I had packages, then I would just need to know 
>>> what version of the package was used, load it, and re-run the 
>>> analysis. I wouldn't need to store the critical functions in the 
>>> object. Where might I find good introduction to writing packages?
>>>
>>> In the short-term would the solution above (using body and args) 
>>> work?
>>>
>>> Thanks again,
>>> Steve
>>>
>>>
>>> -----Original Message-----
>>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>>> Sent: Saturday, March 25, 2006 5:31 PM
>>> To: Steven Lacey
>>> Cc: r-help at stat.math.ethz.ch
>>> Subject: Re: [R] object size vs. file size
>>>
>>>
>>> On 3/25/2006 7:32 AM, Steven Lacey wrote:
>>>> Hi,
>>>>
>>>> There is rather large discrepancy in the size of the object as it 
>>>> lives in R and the size of the object when it is written to the 
>>>> disk. The object in question is an S4 of a homemade class "sa". I 
>>>> first call a function that writes a list of these objects to a file 
>>>> called "data.RData". The size of this file is 14,411 KB. I would 
>>>> assume on average then, that each list component--there are 32 sa 
>>>> objects in data.RData--would be approximately 450 KB (14,111/32). 
>>>> However, when I load the data into R and call object.size on just 
>>>> one s4 object (call it tmp) it returns 77496 bytes (77 KB)! What is 
>>>> even stranger is that if I save this S4 object alone by calling 
>>>> save(tmp, file="tmp.RData"), tmp.RData is 13.3 MB! I understand from 
>>>> the help on object.size that the object size is only approximate and 
>>>> excludes the space recquired to store its name in the symbol table. 
>>>> But, this difference in object size and file size is huge! This 
>>>> phenomenon occurs no matter which S4 object I save from data.RData.
>>>>
>>>> Why is the object so big when it is in a file? What else is getting 
>>>> stored with it? I have examined the object in R to find additional 
>>>> information stored with it, but have not found anything that would 
>>>> account for the size of the object in the file system. For example,
>>>>> environment(tmp)
>>>> NULL
>>> I'm not 100% sure where the problem is, but I think it probably does 
>>> involve environments.  Your tmp object contains a number of 
>>> functions. I think when some function is saved, its environment is 
>>> being saved too, and the environment contains much more than you 
>>> thought.
>>>
>>> R doesn't normally save a new copy of a package or namespace 
>>> environment when it saves a function, nor does it save a complete 
>>> copy of .GlobalEnv with every function defined there, but it does 
>>> save the environment in some other circumstances.  For example, look 
>>> at this code:
>>>
>>>  > f <- function() {
>>> +       notused <- 1:1000000
>>> +       value <- function() 1
>>> +       return(value)
>>> +  }
>>>  >
>>>  >  g <- f()
>>>  >  g
>>> function() 1
>>> <environment: 01B10D1C>
>>>  >  save(g, file='g.RData')
>>>  > object.size(g)
>>> [1] 200
>>>
>>> The g object is 200 bytes or so, but when it is saved, the defining 
>>> environment containing that huge "notused" variable is saved with it, 
>>> so  g.RData ends up being about 4 Megabytes in size.
>>>
>>> I don't know any function that will help to diagnose where this 
>>> happens.  Here's one that doesn't quite work:
>>>
>>> findenvironments <- function(x) {
>>>     e <- environment(x)
>>>     if (is.null(e)) result <- NULL
>>>     else result <- list(e)
>>>     x <- unclass(x)
>>>     if (is.list(x)) {
>>>        for (i in seq(along=x)) {
>>>          contained <- findenvironments(x[[i]])
>>>          if (length(contained)) result <- c(result, contained)
>>>        }
>>>     }
>>>     if (length(result)) browser()
>>>     result
>>> }
>>>
>>> This won't recurse into the slots of an S4 object, so it doesn't 
>>> really help you, and I'm not sure how to do that.  But maybe someone 
>>> else can fix it.
>>>
>>> Duncan Murdoch
>>>
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list 
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide! 
>>> http://www.R-project.org/posting-guide.html
>>>
>> 
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! 
>> http://www.R-project.org/posting-guide.html
> 
> 
>