Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

Mon Feb 14 19:02:31 CET 2005

 The purpose of investigating the entire (200 million record) data set is to
investigate several interpolation models for creating gridded elevation
data. Most models and algorithms do just that...take a manageable number of
"points" and do the math. My reasoning behind using the entire dataset
(which is still a sample of the entire population of possible elevation
values) is perhaps we can "tweak" algorithms that were created 10 years ago
using photo derived contour lines and shuttle radar...but using high
resolution (1 meter average posting density) LIDAR elevation data. A review
of interpolating elevation surfaces to digital terrain models isn't
appropriate for this forum, but, needless to say, the more points I can get
into the model, the more confidence I can get in its output. 

Recent studies have shown that hydrologic models using coarser resolution
(elevation points spaced farther apart) are often way off in their
predictions of aspect and slop calculations...which manifest themselves to
all manner of hydrologic processes (Wetness index, time to concentration,
peak flow, etc..) Thus, the relationship between every single point in this
particular data set must be investigated so we can somehow quantify the
exact impact of error in datasets with coarser resolution, or perhaps come
up with a custom routine based on what R tells us to create a new
interpolation method. 

Ambitious? Yes....

Thanks for the replies so far. 

Tom Colson
Center for Earth Observation
North Carolina State University 
Raleigh, NC 27695
(919) 515 3434
(919) 673 8023
tom_colson at ncsu.edu

Online Calendar:
http://www4.ncsu.edu/~tpcolson

-----Original Message-----
From: Berton Gunter [mailto:gunter.berton at gene.com] 
Sent: Monday, February 14, 2005 12:41 PM
To: 'Thomas Lumley'; 'Thomas Colson'
Cc: r-help at stat.math.ethz.ch
Subject: Off topic -- large data sets. Was RE: [R] 64 Bit R Background
Question

> > read all 200 million rows a pipe dream no matter what
> platform I'm using?
> 
> In principle R can handle this with enough memory. However, 200 
> million rows and three columns is 4.8Gb of storage, and R usually 
> needs a few times the size of the data for working space.
> 
> You would likely be better off not reading the whole data set at once, 
> but loading sections of it from Oracle as needed.
> 
> 
>  	-thomas
> 

Thomas's comment raises a question:

Can comeone give me an example (perhaps in a private response, since I'm off
topic here) where one actually needs all cases in a large data set ("large"
being > 1e6, say) to do a STATISTICAL analysis? By "statistical" I exclude,
say searching for some particular characteristic like an adverse event in a
medical or customer repair database, etc. Maybe a definition of
"statistical" is: anything that cannot be routinely done in a single pass
database query.

The reason I ask this is that it seems to me that with millions of cases,
(careful, perhaps stratified or in some other not completely at random way)
sampling should always suffice to reduce a dataset to manageable size
sufficient for the data analysis needs at hand. But my ignorance and naivete
probably show here.

Thanks.

-- Bert