[R] object size vs. file size

Steven Lacey slacey at umich.edu
Sun Mar 26 05:06:35 CEST 2006


Duncan, 

Thanks so much. This exchange has been really informative!

When you say...

"No, args() and body() don't return the environment, and that means that 
you *can't* recreate the original function, because the environment is 
an integral part of an R function.  It's where the definition for 
everything external comes from."

What do you mean that the environment is where everything external to the
function comes from? Aren't arguments passed into a function? Aren't they
independent of the function's environment?

I suspect I need deeper understanding of environments and functions. Would
you recommend a good reference?

Thanks, 
Steve

-----Original Message-----
From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
Sent: Saturday, March 25, 2006 9:12 PM
To: Steven Lacey
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] object size vs. file size


On 3/25/2006 8:51 PM, Steven Lacey wrote:
> Duncan,
> 
> Thanks! This is progress! One solution might be to remove all 
> environments from the objects that I want to save in the "sa" object, 
> thereby avoiding the problem of saving environments altogther. But, 
> can I remove the environment from a function? Does that even make 
> sense given how R operates under the hood? Even if I could, would the 
> functions still work?

Functions in R consist of 3 parts:  the formals, the body, and the 
environment.  You can't remove any part, but you can change it.

You need at least the base environment, or R won't know what things like 
"+" mean.  You'd normally want .GlobalEnv (the top level workspace where 
you work), because that implies access to all the attached packages, and 
your functions may well need access to some of the functions, e.g. from 
the stats package.  The risk is that you may have redefined a standard 
function, and so get non-standard results:  that's why many (most?) 
packages define namespaces, and use those as the enclosures of their 
functions.

> 
> Here is my more general problem. As I learn more about R and the 
> demands made on my code change, I sometimes change a function 
> referenced by a given name rather than explicitly defining a new 
> version of that function. This creates a problem when I want to review 
> how the model stored in the "sa" object was originally created. If 
> only the function name is stored in the "sa" object, I won't 
> necessarily know what version was actually called at the time the 
> model was constructed because I did not rename it. To deal with this I 
> decided to store the function itself.
> 
> Sounds like this may not be a great idea, or at least comes with 
> serious trade-offs, particularly as some functions are generic like 
> the mean. Is there a better way to save a function than to save the 
> function itself or just its name? For instance, do args() and body() 
> return an associated environment? I assume I could recreate the 
> original function from these objects, correct? If so, is there some 
> easy way to do it?

No, args() and body() don't return the environment, and that means that 
you *can't* recreate the original function, because the environment is 
an integral part of an R function.  It's where the definition for 
everything external comes from.

> Alternatively, are there any version control tools built into R? That 
> is, is there a way R can keep track of the version for me (as opposed 
> to explicitly declaring different verions foo<-..., foo.v1<-..., 
> foo.v2<-...)?

Not really.  There are some version control mechanisms for packages, but 
not any finer than that.

I am not
> sure exactly what I am asking for here. The more I write the more this 
> seems unreasonable. A new function requires a new name, right? I just 
> find myself writing lots of new versions and keeping track of their 
> names, which one does what, and changing the names in other functions 
> that call them a little overwhelming. Maybe the way to deal with this 
> is to write different versions of same package. That way the versions 
> will effect the naming of and the call to load the package, but not 
> the calls to individual functions. This way functions can have the 
> same name, but do different things depending on the package version, 
> not the function name. However, I have never created a package and 
> would prefer not to do so in the short-term (my dissertation is due in 
> August), unless it is fairly straightforward.

It's not too hard, and once you get used to it, it's definitely 
worthwhile.  With a 5 month deadline I think it will help you more than 
delay you.

The "Writing R Extensions" manual is the place to look.  You can use the 
package.skeleton() function to set up the initial structure for you, 
then edit it at your leisure.

The advantage of this is that things are much more structured, and it's 
harder to lose track of what's going on.  If you use Subversion or CVS 
it's very easy to maintain a revision history of a package.
> 
> The more I think about it a package is more accurately what I want. I 
> want to be able to recreate the analysis of my data long after it has 
> been completed. If I had packages, then I would just need to know what 
> version of the package was used, load it, and re-run the analysis. I 
> wouldn't need to store the critical functions in the object. Where 
> might I find good introduction to writing packages?
> 
> In the short-term would the solution above (using body and args) work?

It's not really safe.  What if your function needed a nonstandard 
environment, e.g. it was the result of approxfun()?

Duncan Murdoch
> 
> Thanks again,
> Steve
> 
> 
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
> Sent: Saturday, March 25, 2006 5:31 PM
> To: Steven Lacey
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] object size vs. file size
> 
> 
> On 3/25/2006 7:32 AM, Steven Lacey wrote:
>> Hi,
>>  
>> There is rather large discrepancy in the size of the object as it
>> lives in R and the size of the object when it is written to the disk. 
>> The object in question is an S4 of a homemade class "sa". I first call 
>> a function that writes a list of these objects to a file called 
>> "data.RData". The size of this file is 14,411 KB. I would assume on 
>> average then, that each list component--there are 32 sa objects in 
>> data.RData--would be approximately 450 KB (14,111/32). However, when I 
>> load the data into R and call object.size on just one s4 object (call 
>> it tmp) it returns 77496 bytes (77 KB)! What is even stranger is that 
>> if I save this S4 object alone by calling save(tmp, file="tmp.RData"), 
>> tmp.RData is 13.3 MB! I understand from the help on object.size that 
>> the object size is only approximate and excludes the space recquired 
>> to store its name in the symbol table. But, this difference in object 
>> size and file size is huge! This phenomenon occurs no matter which S4 
>> object I save from data.RData.
>>  
>> Why is the object so big when it is in a file? What else is getting
>> stored with it? I have examined the object in R to find additional 
>> information stored with it, but have not found anything that would 
>> account for the size of the object in the file system. For example,
>>> environment(tmp)
>> NULL
> 
> I'm not 100% sure where the problem is, but I think it probably does
> involve environments.  Your tmp object contains a number of functions. 
> I think when some function is saved, its environment is being saved too, 
> and the environment contains much more than you thought.
> 
> R doesn't normally save a new copy of a package or namespace 
> environment
> when it saves a function, nor does it save a complete copy of .GlobalEnv 
> with every function defined there, but it does save the environment in 
> some other circumstances.  For example, look at this code:
> 
>  > f <- function() {
> +       notused <- 1:1000000
> +       value <- function() 1
> +       return(value)
> +  }
>  >
>  >  g <- f()
>  >  g
> function() 1
> <environment: 01B10D1C>
>  >  save(g, file='g.RData')
>  > object.size(g)
> [1] 200
> 
> The g object is 200 bytes or so, but when it is saved, the defining
> environment containing that huge "notused" variable is saved with it, so 
>   g.RData ends up being about 4 Megabytes in size.
> 
> I don't know any function that will help to diagnose where this happens. 
>   Here's one that doesn't quite work:
> 
> findenvironments <- function(x) {
>      e <- environment(x)
>      if (is.null(e)) result <- NULL
>      else result <- list(e)
>      x <- unclass(x)
>      if (is.list(x)) {
>         for (i in seq(along=x)) {
>           contained <- findenvironments(x[[i]])
>           if (length(contained)) result <- c(result, contained)
>         }
>      }
>      if (length(result)) browser()
>      result
> }
> 
> This won't recurse into the slots of an S4 object, so it doesn't 
> really
> help you, and I'm not sure how to do that.  But maybe someone else can 
> fix it.
> 
> Duncan Murdoch
> 
> 
> 
>




More information about the R-help mailing list