[R] R memory and CPU requirements

Fri Oct 17 16:53:38 CEST 2003

On Friday 17 October 2003 03:33, Alexander Sirotkin \[at Yahoo\] wrote:

> > > > > One more (hopefully last one) : I've been very
> > > > > surprised when I tried to fit a model (using
> > > > > aov())
> > > > > for a sample of size 200 and 10 variables and
> > > > > their interactions.
> > > >
> > > > That doesn't really say much. How many of these
> > > > variables are factors ? How
> > > > many levels do they have ? And what is the order
> > > > of the interaction ? (Note
> > > > that for 10 numeric variables, if you allow all
> > > > interactions, then there will
> > > > be a 100 terms in your model. This increases for
> > > > factors.)
> > > >
> > > > In other words, how big is your model matrix ?
> > >
> > > I see...
> > >
> > > Unfortunately, model.matrix() ran out of memory :)
> > > I have 10 variables, 6 of which are factor, 2 of
> > which
> > > have quite a lot of levels (about 40). And I would
> > > like to allow all interactions.
> > >
> > > I understand your point about categorical
> >
> > variables,
> >
> > > but still - this does not seem like too much data
> >
> > to me.
> >
> > That's one way to look at it. You don't have enough
> > data for the model you are
> > trying to fit. The usual approach under these
> > circumstances is to try
> > 'simpler' models.
> >
> > Please try to understand what you are trying to do
> > (in this case by reading an
> > introductory linear model text) before blindly
> > applying a methodology.
> >
> > Deepayan
>
> I did study ANOVA and I do have enough observations.
> 200 was only a random sample of more then 5000 which I
> think should be enough. However, I'm afraid to even
> think about amount of RAM I will need with R to fit a
> model for this data.

Let's see. You have 10 variables, 6 of which are factors, 2 of which have at 
least 40 levels, and you want all interactions. Let's conservatively estimate 
that all the other four factors have only 2 levels. 

> x1 = gl(40, 1, 1)
> x2 = gl(40, 1, 1)
> x3 = gl(2, 1, 1)
> x4 = gl(2, 1, 1)
> x5 = gl(2, 1, 1)
> x6 = gl(2, 1, 1)

> dim(model.matrix(~ x1 * x2 * x3 * x4 * x5 * x6))
[1]     1 25600

This was for one data point, increasing that would only increase the number of 
rows, the columns would be the same. And of course, this is just for 6-way 
interactions, and the least possible given the information you have given us 
about your model. In actual fact, your model matrix will have many many more 
columns.

I hope you realize that the number of columns in the model matrix is the 
number of parameters you are trying to estimate. If your sample size is less 
than this number (and 5000 is way less), then there will be infinitely many 
solutions to this problem, each of which will fit your data perfectly. Do you 
really want such an answer ? Assuming that you find one, what are you going 
to do with it ?

I have no idea what made you choose such an high order model, but as Andy has 
said, you really should try to figure out what exactly your goals are before 
proceeding. If you believe that your data can really not be modeled 
reasonably by anything simpler, you probably should not use a linear model at 
all. 

Hope that helps,

Deepayan