[R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns

Jason Roberts jason.roberts at duke.edu
Fri Oct 14 18:06:40 CEST 2011


I would like to build a forest of regression trees to see how well some
covariates predict a response variable and to examine the importance of the
covariates. I have a small number of covariates (8) and large number of
records (27368). The response and all of the covariates are continuous
variables.

A cursory examination of the covariates does not suggest they are correlated
in a simple fashion (e.g. the variance inflation factors are all fairly low)
but common sense suggests there should be some relationship: one of them is
the day of the year and some of the others are environmental parameters such
as water temperature. For this reason I would like to follow the advice of
Strobl et al. (2008) and try the authors' conditional variable importance
measure. This is implemented in the party package by calling varimp(...,
conditional=TRUE). Unfortunately, when I call that on my forest I receive
the error:

> varimp(myforest, conditional=TRUE)
Error in model.matrix.default(as.formula(f), data = blocks) : 
  term 1 would require 9e+12 columns

Does anyone know what is wrong?

I noticed a post in June 2011 where a user reported this message and the
ultimate problem was that the importance measure was being conditioned on
too many variables (47). I have only a small number of variables here so I
guessed that was not the problem.

Another suggestion was that there could be a factor with too many levels. In
my case, all of the variables are continuous. Term 1 (x1 below) is the day
of the year, which does happen to be integers 1 ... 366. But the variable is
class numeric, not integer, so I don't believe cforest would treat it as a
factor, although I do not know how to tell whether cforest is treating
something as continuous or as a factor.

Thank you for any help you can provide. I am running R 2.13.1 with party
0.9-99994. You can download the data from
http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code:

> load("\\Temp\\data.rdata")
> nrow(df)
[1] 27368
> summary(df)
       y                 x1              x2               x3
x4             x5                  x6              x7                  x8

 Min.   :  0.000   Min.   :  1.0   Min.   :0.0000   Min.   :  1.00   Min.
:  52   Min.   : 0.008184   Min.   :16.71   Min.   :0.0000000   Min.   :
0.02727  
 1st Qu.:  0.000   1st Qu.:105.0   1st Qu.:0.0000   1st Qu.: 30.00   1st
Qu.:1290   1st Qu.: 6.747035   1st Qu.:23.92   1st Qu.:0.0000000   1st Qu.:
0.11850  
 Median :  1.282   Median :169.0   Median :0.2353   Median : 38.00   Median
:1857   Median :11.310277   Median :26.35   Median :0.0001569   Median :
0.14625  
 Mean   :  5.651   Mean   :178.7   Mean   :0.2555   Mean   : 55.03   Mean
:1907   Mean   :12.889021   Mean   :26.31   Mean   :0.0162043   Mean   :
0.20684  
 3rd Qu.:  5.353   3rd Qu.:262.0   3rd Qu.:0.4315   3rd Qu.: 47.00   3rd
Qu.:2594   3rd Qu.:18.427410   3rd Qu.:28.95   3rd Qu.:0.0144660   3rd Qu.:
0.20095  
 Max.   :195.238   Max.   :366.0   Max.   :1.0000   Max.   :400.00   Max.
:3832   Max.   :29.492380   Max.   :31.73   Max.   :0.3157486   Max.
:11.76877  
> library(HH)
<output deleted>
> vif(y ~ ., data=df)
      x1       x2       x3       x4       x5       x6       x7       x8 
1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580
> library(party)
<output deleted>
> mycontrols <- cforest_unbiased(ntree=50, mtry=3)           # Small forest
but requires a few minutes
> myforest <- cforest(y ~ ., data=df, controls=mycontrols)
> varimp(myforest)
        x1         x2         x3         x4         x5         x6         x7
x8 
 11.924498 103.180195  16.228864  30.658946   5.053500  12.820551   2.113394
6.911377
> varimp(myforest, conditional=TRUE)
Error in model.matrix.default(as.formula(f), data = blocks) : 
  term 1 would require 9e+12 columns



More information about the R-help mailing list