[R] Performing Analysis on Subset of External data

Thomas Lumley tlumley at u.washington.edu
Wed Oct 6 20:13:17 CEST 2004


On Wed, 6 Oct 2004, Laura Quinn wrote:

> Hi,
>
> I want to perform some analysis on subsets of huge data files. There are
> 20 of the files and I want to select the same subsets of each one (each
> subset is a chunk of 1500 or so consecutive rows from several million). To
> save time and processing power is there a method to tell R to *only* read
> in these rows, rather than reading in the entire dataset then selecting
> subsets and deleting the extraneous data? This method takes a rather silly
> amount of time and results in memory problems.

It depends on the data format.  If, for example, you have free-format text 
files it isn't possible to locate a specific chunk without reading all the 
earlier entries.  You can still save time and space by having some other 
program (?Perl) read the file and spit out a file with just the 1500 rows 
you want.

A better strategy would be for the data to be either in a database or in a 
format such as netCDF designed for random access.

 	-thomas




More information about the R-help mailing list