[R] Cforest and Random Forest memory use

Mon Jun 14 16:19:11 CEST 2010

The first thing that I would recommend is to avoid the "formula
interface" to models. The internals that R uses to create matrices
form a formula+data set are not efficient. If you had a large number
of variables, I would have automatically pointed to that as a source
of issues. cforest and ctree only have formula interfaces though, so
you are stuck on that one. The randomForest package has both
interfaces, so that might be better.

Probably the issue is the depth of the trees. With that many
observations, you are likely to get extremely deep trees. You might
try limiting the depth of the tree and see if that has an effect on
performance.

We run into these issues with large compound libraries; in those cases
we do whatever we can to avoid ensembles of trees or kernel methods.
If you want those, you might need to write your own code that is
hyper-efficient and tuned to your particular data structure (as we
did).

On another note... are this many observations really needed? You have
40ish variables; I suspect that >1M points are pretty densely packed
into 40-dimensional space. Do you loose much by sampling the data set
or allocating a large portion to a test set? If you have thousands of
predictors, I could see the need for so many observations, but I'm
wondering if many of the samples are redundant.

Max

On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane <mlokane at gmail.com> wrote:
> Answers added below.
> Thanks again,
> Matt
>
> On 11 June 2010 14:28, Max Kuhn <mxkuhn at gmail.com> wrote:
>>
>> Also, you have not said:
>>
>>  - your OS: Windows Server 2003 64-bit
>>  - your version of R: 2.11.1 64-bit
>>  - your version of party: 0.9-9995
>
>
>>
>>  - your code:  test.cf <-(formula=badflag~.,data =
>> example,control=cforest_control
>
>                                              (teststat = 'max', testtype =
> 'Teststatistic', replace = FALSE, ntree = 500, savesplitstats = FALSE,mtry =
> 10))
>
>>  - what "Large data set" means: > 1 million observations, 40+ variables,
>> around 200MB
>>  - what "very large model objects" means - anything which breaks
>>
>> So... how is anyone suppose to help you?
>>
>> Max
>
>

-- 

Max