[R] Large file size while persisting rpart model to disk

Wed Feb 4 16:57:03 CET 2009

On Wed, 4 Feb 2009, Duncan Murdoch wrote:

> One correction below, and a suggested alternative approach.
>
> On 2/4/2009 9:31 AM, Terry Therneau wrote:
>>   In R, functions remember their entire calling chain.  The good thing 
>> about this is that they can find variables further up in the nested 
>> context, i.e.,
>>     mfun <- function(x) { x+y}
>> will look for 'y' in the function that called myfun, then in the function 
>> that
>> called the function, .... on up and then through the search() list.  This 
>> makes
>> life easier for certain things such as minimizers.
>
> This description is not right: it's not the caller, it's the environment 
> where mfun was created.  So it applies to nested functions (as you said), but 
> the caller is irrelevant.
>
>>
>>   The bad thing is that to make this work R has to remember all of the 
>> variables that were available up the entire chain, and 99-100% of them 
>> aren't necessary.  (Because of constructs like get(varname) a parser can't 
>> read the code to decide what might be needed). 
>
> I'm not sure what you mean by "chain" here, but the real issue is that all 
> the variables in the function that creates mfun will be kept as long as mfun 
> exists.
>
>>
>>   This is an issue with embedded functions.  I recently noticed an extreme 
>> case of it in the pspline routine and made changes to fix it.  The short 
>> version
>>   	pspline(x, ...other args) {
>>   		some computations to define an X matrix, which can be large
>>   		define a print function
>>   		...
>>   		return(X, printfun, other stuff)
>>   		}
>
> So here printfun captures all the local variables in pspline, even if it 
> doesn't need them.
>
>> It's even worse in the frailty functions, where X can be VERY large.
>> The print function's environment wanted to 'remember' all of the temporary 
>> work that went into defining X, plus X itself and so would be huge.  My 
>> solution was add the line
>> 	environment(printfun) <- new.env(parent=baseenv())
>> which marks the function as not needing anything from the local 
>> environment, only the base R definitions.  This would probably be a good 
>> addition to rpart, but I need to look closer.
>>    My first cut was to use emptyenv(), but that wasn't so smart.  It leaves 
>> everything undefined, like "+" for instance. :-)
>
> Another approach is simply to rm() the variables that aren't needed before 
> returning a function.  For example, this function has locals x and y, but 
> only needs y for the returned function to work:
>
>> fnbuilder <- function(n) {
> +    x <- numeric(n)
> +    y <- numeric(n)
> +    noneedforx <- function() sum(y)
> +    rm(x)
> +    return(noneedforx)
> + }
>> f <- fnbuilder(10000)
>> f()
> [1] 0

I would discourage the use of rm() here as it changes at runtime the
variables that are defined for subsequent expressions.  It isn't a
problem here since nothing much happens after the rm but in general it
can complicate reading the code for humans or analyzing the code
programmatically.  It is possible that using rm inside a function may
not be fully supported under all circumstances in the future. (E.g. it
might signal an error in compiled code or might inhibit useful
compilation or something along those lines.)

My preference in situations where I need to control the captured
environment is to lift the code constructing the closure to the top
level of the package, so continuing with this example that would mean
defining an auxiliary function that creates the closure, something
like

     fnbuilder_y_only <- function(y)
 	function() sum(y)

     fnbuilder <- function(n) {
 	x <- numeric(n)
 	y <- numeric(n)
 	noneedforx <- fnbuilder_y_only(y)
 	return(noneedforx)
     }

This approach also has the advantage that the environment only
captures what you explicitly provide, whereas with rm you risk
forgetting to take out something large in more complicated code.

In principle it is possible to analyze the code of the closure
function and only capture bindings that might be needed, but with R's
semantics allowing functions to look into callers and such pretty much
anything 'might be needed' unless we provide some sort of declaration
mechanism for saying, for example, only explicitly referenced variables
are to be considered needed.

Best,

luke

>
> To see what actually got carried along with f, use ls():
>
>> ls(environment(f))
> [1] "n"          "noneedforx" "y"
>
> So we've picked up the arg n, and our local copy of noneedforx, but we did 
> manage to get rid of x.  (The local copy costs almost nothing:  R will just 
> have another reference to the same object as f refers to.  The arg could have 
> been rm'd too, if it was big enough to matter.)
>
> Duncan Murdoch
>
>>       	Terry Therneau
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu