Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

Berton Gunter gunter.berton at gene.com
Mon Feb 14 18:41:25 CET 2005


> > read all 200 million rows a pipe dream no matter what 
> platform I'm using?
> 
> In principle R can handle this with enough memory. However, 
> 200 million 
> rows and three columns is 4.8Gb of storage, and R usually needs a few 
> times the size of the data for working space.
> 
> You would likely be better off not reading the whole data set 
> at once, but 
> loading sections of it from Oracle as needed.
> 
> 
>  	-thomas
> 

Thomas's comment raises a question:

Can comeone give me an example (perhaps in a private response, since I'm off
topic here) where one actually needs all cases in a large data set ("large"
being > 1e6, say) to do a STATISTICAL analysis? By "statistical" I exclude,
say searching for some particular characteristic like an adverse event in a
medical or customer repair database, etc. Maybe a definition of
"statistical" is: anything that cannot be routinely done in a single pass
database query.

The reason I ask this is that it seems to me that with millions of cases,
(careful, perhaps stratified or in some other not completely at random way)
sampling should always suffice to reduce a dataset to manageable size
sufficient for the data analysis needs at hand. But my ignorance and naivete
probably show here.

Thanks.

-- Bert




More information about the R-help mailing list