[R] model.matrix consistency across repeated calls

Thu Nov 1 00:56:37 CET 2007

I am using R to construct model matrices that I then pass into C for
subsequent fitting.

Suppose I have a data.frame so big that, if I called 'model.matrix'
directly on the whole thing, the results would be too big to handle
(because factors expand to multiple columns, etc.). Instead, I really
want to sequentially call 'model.matrix' on subsets of rows, and then
'rbind' the [compressed] results. However, this is not  guaranteed to
give the same result as just calling 'model.matrix' on the whole thing.
Certain terms used in formulae, such as 'poly', are sensitive to the
range of their argument; and I'm also worried about things like columns
sometimes disappearing when particular levels of a factor don't appear
in one of the subsets (I don't think that one actually happens, but I'm
not *sure*).

Can anyone suggest how to achieve consistency-of-interpretation across
calls to 'model.matrix'? For example: are there certain types of term in
formulae that I just have to avoid? Or can I benefit somehow from
'model.frame'(which I have never understood...)?

In case you are wondering: I'm not going to directly rbind the results
together, of course. After each call to model.matrix, I pass the result
of that call into C where I compress it massively, and the compressed
version of the whole thing squeezes into memory OK.

Thanks for any help (preferably replying to me as well as to the list--
ta)

Mark

-- 
Mark Bravington
CSIRO Mathematical & Information Sciences
Marine Laboratory
Castray Esplanade
Hobart 7001
TAS

ph (+61) 3 6232 5118
fax (+61) 3 6232 5012
mob (+61) 438 315 623