[R] Reading large datasets and fitting logistic models in R

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Aug 10 08:18:43 CEST 2008


See also bigglm() in package biglm.

On Sat, 9 Aug 2008, Pradheep K E wrote:

> Hi R-experts,
>
> Does anyone have experience using R for handling large scale data (millions
> of rows, hundreds or thousands of features)?
>
> What is the largest size of data that anyone has used with glm?

I've used 700,000 rows and about 100 cols, but it was 4 years ago and we 
have more memory now.  It matters if the 'features' are numeric or 
categorical, as the latter can expand to many columns in the model matrix.

As a rough guide, expect to need 200x as much memory in bytes as nrows x 
ncols.  Using glm.fit will be more efficient (I've just tested 100,000 x 
100 which used 1.2Gb).

> Also, is there a library to read data in sparse data format (like SVMlight
> format)?

You mean *store* data in a sparse format when read in?  I'm not sure of 
the relevance, but look at the function method for bigglm for a way to 
avoid even doing that. If the data are numeric there are at least three 
sparse-matrix packages on CRAN.

Ultimately R's code such as glm() is designed for flexibility and to do 
interesting things with the fit: for really large problems you will do 
better to write a specialized fitting routine.  bigglm() is an
intermediate position.

There's also the question of whether there are any interesting homogeneous 
datasets of this sort of size.  Often doing analyses on subsets and a 
meta-analysis is a much more insightful approach (as it was in our 
problem: we split on one of the categorical explanatory variables).

> Thanks
> Pradheep
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list