[R] object size vs. file size

Sun Mar 26 05:18:15 CEST 2006

On 3/25/2006 10:06 PM, Steven Lacey wrote:
> Duncan, 
> 
> Thanks so much. This exchange has been really informative!
> 
> When you say...
> 
> "No, args() and body() don't return the environment, and that means that 
> you *can't* recreate the original function, because the environment is 
> an integral part of an R function.  It's where the definition for 
> everything external comes from."
> 
> What do you mean that the environment is where everything external to the
> function comes from? Aren't arguments passed into a function? Aren't they
> independent of the function's environment?

Take this example:

f <- function(x) {
   return(mean(x))
}

It doesn't define the mean function.  It looks it up in the enclosing 
environment.  If someone has redefined mean() there, that will change 
the value of f().
> 
> I suspect I need deeper understanding of environments and functions. Would
> you recommend a good reference?

The R Language Definition manual has some discussion of them.

Duncan Murdoch

> 
> Thanks, 
> Steve
> 
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
> Sent: Saturday, March 25, 2006 9:12 PM
> To: Steven Lacey
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] object size vs. file size
> 
> 
> On 3/25/2006 8:51 PM, Steven Lacey wrote:
>> Duncan,
>>
>> Thanks! This is progress! One solution might be to remove all 
>> environments from the objects that I want to save in the "sa" object, 
>> thereby avoiding the problem of saving environments altogther. But, 
>> can I remove the environment from a function? Does that even make 
>> sense given how R operates under the hood? Even if I could, would the 
>> functions still work?
> 
> Functions in R consist of 3 parts:  the formals, the body, and the 
> environment.  You can't remove any part, but you can change it.
> 
> You need at least the base environment, or R won't know what things like 
> "+" mean.  You'd normally want .GlobalEnv (the top level workspace where 
> you work), because that implies access to all the attached packages, and 
> your functions may well need access to some of the functions, e.g. from 
> the stats package.  The risk is that you may have redefined a standard 
> function, and so get non-standard results:  that's why many (most?) 
> packages define namespaces, and use those as the enclosures of their 
> functions.
> 
>> Here is my more general problem. As I learn more about R and the 
>> demands made on my code change, I sometimes change a function 
>> referenced by a given name rather than explicitly defining a new 
>> version of that function. This creates a problem when I want to review 
>> how the model stored in the "sa" object was originally created. If 
>> only the function name is stored in the "sa" object, I won't 
>> necessarily know what version was actually called at the time the 
>> model was constructed because I did not rename it. To deal with this I 
>> decided to store the function itself.
>>
>> Sounds like this may not be a great idea, or at least comes with 
>> serious trade-offs, particularly as some functions are generic like 
>> the mean. Is there a better way to save a function than to save the 
>> function itself or just its name? For instance, do args() and body() 
>> return an associated environment? I assume I could recreate the 
>> original function from these objects, correct? If so, is there some 
>> easy way to do it?
> 
> No, args() and body() don't return the environment, and that means that 
> you *can't* recreate the original function, because the environment is 
> an integral part of an R function.  It's where the definition for 
> everything external comes from.
> 
>> Alternatively, are there any version control tools built into R? That 
>> is, is there a way R can keep track of the version for me (as opposed 
>> to explicitly declaring different verions foo<-..., foo.v1<-..., 
>> foo.v2<-...)?
> 
> Not really.  There are some version control mechanisms for packages, but 
> not any finer than that.
> 
> I am not
>> sure exactly what I am asking for here. The more I write the more this 
>> seems unreasonable. A new function requires a new name, right? I just 
>> find myself writing lots of new versions and keeping track of their 
>> names, which one does what, and changing the names in other functions 
>> that call them a little overwhelming. Maybe the way to deal with this 
>> is to write different versions of same package. That way the versions 
>> will effect the naming of and the call to load the package, but not 
>> the calls to individual functions. This way functions can have the 
>> same name, but do different things depending on the package version, 
>> not the function name. However, I have never created a package and 
>> would prefer not to do so in the short-term (my dissertation is due in 
>> August), unless it is fairly straightforward.
> 
> It's not too hard, and once you get used to it, it's definitely 
> worthwhile.  With a 5 month deadline I think it will help you more than 
> delay you.
> 
> The "Writing R Extensions" manual is the place to look.  You can use the 
> package.skeleton() function to set up the initial structure for you, 
> then edit it at your leisure.
> 
> The advantage of this is that things are much more structured, and it's 
> harder to lose track of what's going on.  If you use Subversion or CVS 
> it's very easy to maintain a revision history of a package.
>> The more I think about it a package is more accurately what I want. I 
>> want to be able to recreate the analysis of my data long after it has 
>> been completed. If I had packages, then I would just need to know what 
>> version of the package was used, load it, and re-run the analysis. I 
>> wouldn't need to store the critical functions in the object. Where 
>> might I find good introduction to writing packages?
>>
>> In the short-term would the solution above (using body and args) work?
> 
> It's not really safe.  What if your function needed a nonstandard 
> environment, e.g. it was the result of approxfun()?
> 
> Duncan Murdoch
>> Thanks again,
>> Steve
>>
>>
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>> Sent: Saturday, March 25, 2006 5:31 PM
>> To: Steven Lacey
>> Cc: r-help at stat.math.ethz.ch
>> Subject: Re: [R] object size vs. file size
>>
>>
>> On 3/25/2006 7:32 AM, Steven Lacey wrote:
>>> Hi,
>>>  
>>> There is rather large discrepancy in the size of the object as it
>>> lives in R and the size of the object when it is written to the disk. 
>>> The object in question is an S4 of a homemade class "sa". I first call 
>>> a function that writes a list of these objects to a file called 
>>> "data.RData". The size of this file is 14,411 KB. I would assume on 
>>> average then, that each list component--there are 32 sa objects in 
>>> data.RData--would be approximately 450 KB (14,111/32). However, when I 
>>> load the data into R and call object.size on just one s4 object (call 
>>> it tmp) it returns 77496 bytes (77 KB)! What is even stranger is that 
>>> if I save this S4 object alone by calling save(tmp, file="tmp.RData"), 
>>> tmp.RData is 13.3 MB! I understand from the help on object.size that 
>>> the object size is only approximate and excludes the space recquired 
>>> to store its name in the symbol table. But, this difference in object 
>>> size and file size is huge! This phenomenon occurs no matter which S4 
>>> object I save from data.RData.
>>>  
>>> Why is the object so big when it is in a file? What else is getting
>>> stored with it? I have examined the object in R to find additional 
>>> information stored with it, but have not found anything that would 
>>> account for the size of the object in the file system. For example,
>>>> environment(tmp)
>>> NULL
>> I'm not 100% sure where the problem is, but I think it probably does
>> involve environments.  Your tmp object contains a number of functions. 
>> I think when some function is saved, its environment is being saved too, 
>> and the environment contains much more than you thought.
>>
>> R doesn't normally save a new copy of a package or namespace 
>> environment
>> when it saves a function, nor does it save a complete copy of .GlobalEnv 
>> with every function defined there, but it does save the environment in 
>> some other circumstances.  For example, look at this code:
>>
>>  > f <- function() {
>> +       notused <- 1:1000000
>> +       value <- function() 1
>> +       return(value)
>> +  }
>>  >
>>  >  g <- f()
>>  >  g
>> function() 1
>> <environment: 01B10D1C>
>>  >  save(g, file='g.RData')
>>  > object.size(g)
>> [1] 200
>>
>> The g object is 200 bytes or so, but when it is saved, the defining
>> environment containing that huge "notused" variable is saved with it, so 
>>   g.RData ends up being about 4 Megabytes in size.
>>
>> I don't know any function that will help to diagnose where this happens. 
>>   Here's one that doesn't quite work:
>>
>> findenvironments <- function(x) {
>>      e <- environment(x)
>>      if (is.null(e)) result <- NULL
>>      else result <- list(e)
>>      x <- unclass(x)
>>      if (is.list(x)) {
>>         for (i in seq(along=x)) {
>>           contained <- findenvironments(x[[i]])
>>           if (length(contained)) result <- c(result, contained)
>>         }
>>      }
>>      if (length(result)) browser()
>>      result
>> }
>>
>> This won't recurse into the slots of an S4 object, so it doesn't 
>> really
>> help you, and I'm not sure how to do that.  But maybe someone else can 
>> fix it.
>>
>> Duncan Murdoch
>>
>>
>>
>>
> 
> 
> 
>