[R] Large file size while persisting rpart model to disk

Duncan Murdoch murdoch at stats.uwo.ca
Wed Feb 4 16:12:57 CET 2009


One correction below, and a suggested alternative approach.

On 2/4/2009 9:31 AM, Terry Therneau wrote:
>   In R, functions remember their entire calling chain.  The good thing about 
> this is that they can find variables further up in the nested context, i.e.,
>     mfun <- function(x) { x+y}
> will look for 'y' in the function that called myfun, then in the function that
> called the function, .... on up and then through the search() list.  This makes
> life easier for certain things such as minimizers.

This description is not right: it's not the caller, it's the environment 
where mfun was created.  So it applies to nested functions (as you 
said), but the caller is irrelevant.

> 
>   The bad thing is that to make this work R has to remember all of the variables 
> that were available up the entire chain, and 99-100% of them aren't necessary.  
> (Because of constructs like get(varname) a parser can't read the code to decide 
> what might be needed).  

I'm not sure what you mean by "chain" here, but the real issue is that 
all the variables in the function that creates mfun will be kept as long 
as mfun exists.

> 
>   This is an issue with embedded functions.  I recently noticed an extreme case 
> of it in the pspline routine and made changes to fix it.  The short version
>   	pspline(x, ...other args) {
>   		some computations to define an X matrix, which can be large
>   		define a print function
>   		...
>   		return(X, printfun, other stuff)
>   		}

So here printfun captures all the local variables in pspline, even if it 
doesn't need them.

> It's even worse in the frailty functions, where X can be VERY large.
> The print function's environment wanted to 'remember' all of the temporary work 
> that went into defining X, plus X itself and so would be huge.  My solution was 
> add the line
> 	environment(printfun) <- new.env(parent=baseenv())
> which marks the function as not needing anything from the local environment, 
> only the base R definitions.  This would probably be a good addition to rpart, 
> but I need to look closer.
>    My first cut was to use emptyenv(), but that wasn't so smart.  It leaves 
> everything undefined, like "+" for instance. :-)

Another approach is simply to rm() the variables that aren't needed 
before returning a function.  For example, this function has locals x 
and y, but only needs y for the returned function to work:

 > fnbuilder <- function(n) {
+    x <- numeric(n)
+    y <- numeric(n)
+    noneedforx <- function() sum(y)
+    rm(x)
+    return(noneedforx)
+ }
 > f <- fnbuilder(10000)
 > f()
[1] 0

To see what actually got carried along with f, use ls():

 > ls(environment(f))
[1] "n"          "noneedforx" "y"

So we've picked up the arg n, and our local copy of noneedforx, but we 
did manage to get rid of x.  (The local copy costs almost nothing:  R 
will just have another reference to the same object as f refers to.  The 
arg could have been rm'd too, if it was big enough to matter.)

Duncan Murdoch

>    
>    	Terry Therneau
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list