[R] read large amount of data
Prof Brian Ripley
ripley at stats.ox.ac.uk
Mon Jul 18 18:34:50 CEST 2005
On Mon, 18 Jul 2005, Thomas Lumley wrote:
> On Mon, 18 Jul 2005, Weiwei Shi wrote:
>
>> Hi,
>> I have a dataset with 2194651x135, in which all the numbers are 0,1,2,
>> and is bar-delimited.
>>
>> I used the following approach which can handle 100,000 lines:
>> t<-scan('fv', sep='|', nlines=100000)
>> t1<-matrix(t, nrow=135, ncol=100000)
>> t2<-t(t1)
>> t3<-as.data.frame(t2)
>>
>> I changed my plan into using stratified sampling with replacement (col
>> 2 is my class variable: 1 or 2). The class distr is like:
>> awk -F\| '{print $2}' fv | sort | uniq -c
>> 2162792 1
>> 31859 2
>>
>> Is it possible to use R to read the whole dataset and do the
>> stratified sampling? Is it really dependent on my memory size?
>
> You may well not be able to read the whole data set into memory at once:
> it would take a bit more than 2Gb memory even to store it.
About 1.2G if stored as an integer (not double) vector.
> You can use readLines to read it in chunks of, say, 10000 lines.
>
> To do stratified sampling I would suggest bernoulli sampling of slightly
> more than you want. Eg if you want 10000 from class 1, keeping each
> elements with probability 10500/2162792 will get you Poisson(10500)
> elements, which will be more than 10000 elements with better than 99.999%
> probability. You can then choose 10000 at random from these. I can't think
> of an approach that it is guaranteed to work in one pass over the data,
> but 99.999% is pretty close.
Reservoir sampling methods will work in one pass. See e.g. my 1987 book
on Stochastic Simulation. But Thomas' idea will be easier to implement in
R, and I would have chosen 20000 not 10500 and be sure I would get enough.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list