[R] Can we do GLM on 2GB data set with R?

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Jan 21 10:13:57 CET 2007


'given sufficient hardware' and a suitable OS, 'yes'.

You will see quoted on this list from time to time:

> library(fortunes)
> fortune("Yoda")

Evelyn Hall: I would like to know how (if) I can extract some of the
information from the summary of my nlme.
Simon Blomberg: This is R. There is no if. Only how.
    -- Evelyn Hall and Simon `Yoda' Blomberg
       R-help (April 2005)

You then mention 'my PC'.  If your 'PC' is running Windows, the answer is 
'with some work', since we don't have a version of R for Win64 and Win32 
is limited to 3GB user address space.

To be more precise, we would have to know more about your GLM (and even if 
you mean GLM in the commonly accepted sense or the SASism with a redundant 
G), including what the variables are (I guess categorical stored as small 
integers?)

My DPhil student Fei Chen looked at ways of applying R to large GLMs with 
data stored in a MySQL database.  His tests were about 4 years ago on 
32-bit Linux, and he was able to run about 1 million cases on 30 
categorical (mainly binary) variables with (I think) up to 5-way 
interactions.  That is a very large GLM problem, and it is unusual for
it to be worth fitting a (mainly linear) model with over 10,000 cases.
(Also, there are normally problems with the homogeneity of very large 
datasets that taint the independence assumptions made by GLMs.)

My guess is that you have been considering the function glm().  There is 
function bigglm() in package biglm (by Thomas Lumley).  I don't think you 
would be able even to load your data into 32-bit R, but it would be 
possible to use the ideas behind bigglm (which was one of the approaches 
Fei assessed) and perhaps even bigglm itself with one of the DBMS 
interfaces to R to retrieve data in chunks.  (bigglm uses chunks of rows, 
but chunks of columns may be more efficient.)

Another possibility is that you want to fit a log-linear model to purely 
categorical data, and could make use of loglin().  That will be more 
efficient if the contingency table is densely populated.

My experience suggests that the important issues here are likely to be 
statistical rather than computational, and this is more a topic for a 
consultant than volunteer help on a discussion list.


On Sat, 20 Jan 2007, WILLIE, JILL wrote:

> We are wanting to use R instead of/in addition to our existing stats
> package because of it's huge assortment of stat functions.  But, we
> routinely need to fit GLM models to files that are approximately 2-4GB
> (as SQL tables, un-indexed, w/tinyint-sized fields except for the
> response & weight variables).  Is this feasible, does anybody know,
> given sufficient hardware, using R?  It appears to use a great deal of
> memory on the small files I've tested.
>
> I've read the data import, memory.limit, memory.size & general
> documentation but can't seem to find a way to tell what the boundaries
> are & roughly gauge the needed memory...other than trial & error.  I've
> started by testing the data.frame & run out of memory on my PC.  I'm new
> to R so please be forgiving if this is a poorly-worded question.
>
> Jill Willie
> Open Seas
> Safeco Insurance
> jilwil at safeco.com
> 206-545-5673


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list