[R] read large amount of data

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Jul 18 18:34:50 CEST 2005

On Mon, 18 Jul 2005, Thomas Lumley wrote:

> On Mon, 18 Jul 2005, Weiwei Shi wrote:
>> Hi,
>> I have a dataset with 2194651x135, in which all the numbers are 0,1,2,
>> and is bar-delimited.
>> I used the following approach which can handle 100,000 lines:
>> t<-scan('fv', sep='|', nlines=100000)
>> t1<-matrix(t, nrow=135, ncol=100000)
>> t2<-t(t1)
>> t3<-as.data.frame(t2)
>> I changed my plan into using stratified sampling with replacement (col
>> 2 is my class variable: 1 or 2). The class distr is like:
>> awk -F\| '{print $2}' fv | sort | uniq -c
>> 2162792 1
>>  31859 2
>> Is it possible to use R to read the whole dataset and do the
>> stratified sampling? Is it really dependent on my memory size?
> You may well not be able to read the whole data set into memory at once:
> it would take a bit more than 2Gb memory even to store it.

About 1.2G if stored as an integer (not double) vector.

> You can use readLines to read it in chunks of, say, 10000 lines.
> To do stratified sampling I would suggest bernoulli sampling of slightly
> more than you want. Eg if you want 10000 from class 1, keeping each
> elements with probability 10500/2162792 will get you Poisson(10500)
> elements, which will be more than 10000 elements with better than 99.999%
> probability. You can then choose 10000 at random from these. I can't think
> of an approach that it is guaranteed to work in one pass over the data,
> but 99.999% is pretty close.

Reservoir sampling methods will work in one pass.  See e.g. my 1987 book 
on Stochastic Simulation.  But Thomas' idea will be easier to implement in 
R, and I would have chosen 20000 not 10500 and be sure I would get enough.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

More information about the R-help mailing list