[Rd] saving objects with embedded environments

McGehee, Robert Robert.McGehee at geodecapital.com
Mon Jul 2 15:27:22 CEST 2007


Thanks for this. So at the risk of treading out too deep into unfamiliar
water, one concern is that if I run 'lm' within a function and then the
function exits, am I still (perhaps unnecessarily) keeping a copy of the
function environment and the associated data? It does not seem that
'update' even works after I exit the function, so it's not clear to me
what help saving the environment and a copy of all of its data is (see
example below).

> B <- data.frame(y=1:100, x=rnorm(100))
> FUN <- function(B) lm(y ~ x, data=B)
> m <- FUN(B)
> rm(B)
## update doesn't find object 'B' in function environment (so why store
the environment?)
> update(m, y ~ 1)  
Error in inherits(x, "data.frame") : object "B" not found

## However, there is a copy of object 'B' saved anyway, even
## after removing it from the global environment and exiting the
function
> dim(get("B", envir=attr(m$terms, ".Environment")))
[1] 100   2

For my purposes all works well now. I brought this up only as one can
quickly run out of memory if data is unnecessarily kept around after
with large models. Anecdotally, before isolating this issue I would
crash my snow/MPI session (followed by R) when trying to transfer these
'lm' and 'lm'-like objects with embedded environments. If I did not
distribute the processing, then I found that I would rather quickly use
up all 24GB of my computer's memory and swap space after repeated calls.

Thanks,
Robert

-----Original Message-----
From: Roger Peng [mailto:rdpeng at gmail.com] 
Sent: Friday, June 29, 2007 7:44 PM
To: McGehee, Robert
Cc: R-devel
Subject: Re: [Rd] saving objects with embedded environments

I believe this is intentional.  See ?serialize.  When lm() is called
in a function, the environment is saved in case the resulting fitted
model object needs to be updated, for example, with update().

if you don't want the linear model object, you might try just saving
the relevant objects to a separate list rather than try to delete
everything that is irrelevant from the 'lm' object.

-roger

On 6/28/07, McGehee, Robert <Robert.McGehee at geodecapital.com> wrote:
> Hello,
> I have been running linear regressions on large data sets. As 'lm'
saves
> a great deal of extraneous (for me) data including the residuals,
> fitted.values, model frame, etc., I generally set these to NULL within
> the object before saving off the model to a file.
>
> In the below example, however, I have found that depending on whether
or
> not I run 'lm' within another function or not, the entire function
> environment is saved off with the file. So, even while object.size and
> all.equal report that both 'lm's are equal and of small size, one
saves
> as a 24MB file and the other as 646 bytes. These seems to be because
in
> the first example the function environment is saved in attr(x1$terms,
> ".Environment") and takes up all 24MB of space.
>
> Anyway, I think this is a bug, or if nothing else very undesirable
(that
> an object reported to be 0.5kb takes up 24MB). There also seems to be
> some inconsistency on how environments are saved depending on if it is
> the global environment or not, though I'm not familiar enough with
> environments to know if this was intentional. Comments are
appreciated.
>
> Thanks,
> Robert
>
> ##################################################################
> testEq <- function(B) {
>     x <- lm(y ~ x1+x2+x3, data=B, model=FALSE)
>     x$residuals <- x$effects <- x$fitted.values <- x$qr$qr <- NULL
>     x
> }
>
> N <- 900000
> B <- data.frame(y=rnorm(N)+1:N, x1=rnorm(N)+1:N, x2=rnorm(N)+1:N,
> x3=rnorm(N)+1:N)
> x1 <- testEq(B)
> x2 <- lm(y ~ x1+x2+x3, data=B, model=FALSE)
> x2$residuals <- x2$effects <- x2$fitted.values <- x2$qr$qr <- NULL
>
> all.equal(x1, x2) ## TRUE
> object.size(x1)  ## 5112
> object.size(x2)  ## 5112
> save(x1, file="x1.RData")
> save(x2, file="x2.RData")
> file.info("x1.RData")$size ## 24063852 bytes
> file.info("x2.RData")$size ## 646 bytes
>
> > R.version
>                _
> platform       i686-pc-linux-gnu
> arch           i686
> os             linux-gnu
> system         i686, linux-gnu
> status
> major          2
> minor          5.0
> year           2007
> month          04
> day            23
> svn rev        41293
> language       R
> version.string R version 2.5.0 (2007-04-23)
>
>
> Robert McGehee, CFA
> Quantitative Analyst
> Geode Capital Management, LLC
> One Post Office Square, 28th Floor | Boston, MA | 02109
> Tel: 617/392-8396    Fax:617/476-6389
> mailto:robert.mcgehee at geodecapital.com
>
>
>
> This e-mail, and any attachments hereto, are intended for
us...{{dropped}}
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/



More information about the R-devel mailing list