[R] Cforest and Random Forest memory use

Max Kuhn mxkuhn at gmail.com
Fri Jun 18 19:35:23 CEST 2010


Rich's calculations are correct, but from a practical standpoint I
think that using all the data for the model is overkill for a few
reasons:

- the calculations that you show implicitly assume that the predictor
values can be reliably differentiated from each other. Unless they are
deterministic calculations (e.g. number of hydrogen bonds, % GC in a
sequence) the measurement error. We don't know anything about the
context here, but in the lab sciences, the measurement variation can
make the *effective* number of predictor values much less than n. So
you can have millions of predictor values but you might only be able
to differentiate k <<<< n values reliably.

- the important dimensionality to consider is based on how many of
those 40 are relevant to the outcome. Again, we don't now the context
of the data but there is a strong prior towards the number of
important variables being less than 40

- We've had to consider these types of problems a lot. We might have
200K samples (compounds in this case) and 1000 predictors that appear
to matter. Ensembles of trees tended to do very well, as did kernel
methods. In either of those two classes of models, the prediction time
for a single new observation is very long. So we looked at how
performance was affected if we were to reduce the training set size.
In essence, we found that <50% of the data could be used with no
appreciable effect on performance. We could make the percentage
smaller if we used the predictor values to sample the data set for
prediction; if we had m samples in the training set, the next sample
added would have to have maximum dissimilarity to the existing m
samples.

- If you are going to do any feature selection, you would be better
off segregating a percentage of those million samples as a hold-out
set to validate the selection process (a few people form Merck have
written excellent papers on the selection bias problem). Similarly, if
this is a classification problem, any ROC curve analysis is most
effective when the cutoffs are derived from a separate hold-out data
set. Just dumping all those samples in a training set seems like a
lost opportunity.

Again, these are not refutations of your calculations. I just think
that there are plenty of non-theoretical arguments for not using all
of those values for the training set.

Thanks,

Max
On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter <gunter.berton at gene.com> wrote:
> Rich is right, of course. One way to think about it is this (parphrased from
> the section on the "Curse of Dimensionality" from Hastie et al's
> "Statistical Learning" Book): suppose 10 uniformly distributed points on a
> line give what you consider to be adequate coverage of the line. Then in 40
> dimensions, you'd need 10^40 uniformly distributed points to give equivalent
> coverage.
>
> Various other aspects of the curse of dimensionality are discussed in the
> book, one of which is that in high dimensions, most points are closer to the
> boundaries then to each other. As Rich indicates, this has profound
> implications for what one can sensibly do with such data. On example is:
> nearest neighbor procedures don't make much sense (as nobody is likely to
> have anybody else nearby). Which Rich's little simulation nicely
> demonstrated.
>
> Cheers to all,
>
> Bert Gunter
> Genentech Nonclinical Statistics
>
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Raubertas, Richard
> Sent: Thursday, June 17, 2010 4:15 PM
> To: Max Kuhn; Matthew OKane
> Cc: r-help at r-project.org
> Subject: Re: [R] Cforest and Random Forest memory use
>
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Max Kuhn
>> Sent: Monday, June 14, 2010 10:19 AM
>> To: Matthew OKane
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Cforest and Random Forest memory use
>>
>> The first thing that I would recommend is to avoid the "formula
>> interface" to models. The internals that R uses to create matrices
>> form a formula+data set are not efficient. If you had a large number
>> of variables, I would have automatically pointed to that as a source
>> of issues. cforest and ctree only have formula interfaces though, so
>> you are stuck on that one. The randomForest package has both
>> interfaces, so that might be better.
>>
>> Probably the issue is the depth of the trees. With that many
>> observations, you are likely to get extremely deep trees. You might
>> try limiting the depth of the tree and see if that has an effect on
>> performance.
>>
>> We run into these issues with large compound libraries; in those cases
>> we do whatever we can to avoid ensembles of trees or kernel methods.
>> If you want those, you might need to write your own code that is
>> hyper-efficient and tuned to your particular data structure (as we
>> did).
>>
>> On another note... are this many observations really needed? You have
>> 40ish variables; I suspect that >1M points are pretty densely packed
>> into 40-dimensional space.
>
> This did not seem right to me:  40-dimensional space is very, very big
> and even a million observations will be thinly spread.  There is probably
> some analytic result from the theory of coverage processes about this,
> but I just did a quick simulation.  If a million samples are independently
> and randomly distributed in a 40-d unit hypercube, then >90% of the points
> in the hypercube will be more than one-quarter of the maximum possible
> distance (sqrt(40)) from the nearest sample.  And about 40% of the hypercube
>
> will be more than one-third of the maximum possible distance to the nearest
> sample.  So the samples do not densely cover the space at all.
>
> One implication is that modeling the relation of a response to 40 predictors
>
> will inevitably require a lot of smoothing, even with a million data points.
>
> Richard Raubertas
> Merck & Co.
>
>> Do you loose much by sampling the data set
>> or allocating a large portion to a test set? If you have thousands of
>> predictors, I could see the need for so many observations, but I'm
>> wondering if many of the samples are redundant.
>>
>> Max
>>
>> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane
>> <mlokane at gmail.com> wrote:
>> > Answers added below.
>> > Thanks again,
>> > Matt
>> >
>> > On 11 June 2010 14:28, Max Kuhn <mxkuhn at gmail.com> wrote:
>> >>
>> >> Also, you have not said:
>> >>
>> >>  - your OS: Windows Server 2003 64-bit
>> >>  - your version of R: 2.11.1 64-bit
>> >>  - your version of party: 0.9-9995
>> >
>> >
>> >>
>> >>  - your code:  test.cf <-(formula=badflag~.,data =
>> >> example,control=cforest_control
>> >
>> >                                              (teststat =
>> 'max', testtype =
>> > 'Teststatistic', replace = FALSE, ntree = 500,
>> savesplitstats = FALSE,mtry =
>> > 10))
>> >
>> >>  - what "Large data set" means: > 1 million observations,
>> 40+ variables,
>> >> around 200MB
>> >>  - what "very large model objects" means - anything which breaks
>> >>
>> >> So... how is anyone suppose to help you?
>> >>
>> >> Max



More information about the R-help mailing list