[R] read.table for a subset of data

Thaden, John J ThadenJohnJ at uams.edu
Mon Mar 12 16:33:23 CET 2007


Feng,
   I had the same question as you, how to read a subset of data, and the same
reaction as Wensui when I discovered that read.table could not.  Even if my
computer's memory were up to it, I am troubled by the idea of reading in 1.8
GB of data (in my case) to get just 4,000 numbers, for instance, particularly
if I'm then going to iterating through the entire dataset in 4,000-number
chunks.  
   I ended up defining a NetCDF format to hold my data using the RNetCDF
package, since that package's var.get.nc() function is perfectly able to read
subsets of a NetCDF variable.  Furthermore, NetCDF files allow data to be
matrices and even higher order arrays, from which you can then retrieve any
chunk by including var.get.nc 'start' and 'count' arguments in the form of
vectors of length equal to the number of array dimensions.  Once a NetCDF
format is defined, all else is painless.  One limitation is that the RNetCDF
package only supports version 3 of the NetCDF library, a version that puts a
2 GB limit on a variable's size.  Version 4 removes this limitation; I'm
hopeful some day that an R package will be an interface to the NetCDF version
4 library.
John Thaden

Message: 22
Date: Sun, 11 Mar 2007 21:33:04 -0500
From: "jim holtman" <jholtman at gmail.com>
Subject: Re: [R] read.table for a subset of data
To: "Wensui Liu" <liuwensui at gmail.com>
Cc: r-help <r-help at stat.math.ethz.ch>
Message-ID:
	<644e1f320703111933g3e5cec0l16b485f2fc0a3dbb at mail.gmail.com>
Content-Type: text/plain

If you know what 10 rows to read, then you can 'skip' to them, but it the
system still has to read each line at a time.

I have a 200,000 line csv file of numerics that takes me 4 seconds to read
in with 'read.csv' using 'colClasses', so I would guess your 100K line file
would take half of that.  Is 2 seconds of time a waste of resources?


On 3/11/07, Wensui Liu <liuwensui at gmail.com> wrote:
>
> Jim,
>
> Glad to see your reply.
>
> Refering to your email, what if I just want to read 10 rows from a csv
> table with 100000 rows? Do you think it a waste of resource to read
> the whole table in?
> Anything thought?
>
> wensui
>
> On 3/11/07, jim holtman <jholtman at gmail.com> wrote:
> > Why cann't you read in the whole data set and then create the
> subsets?  This
> > is easily done with 'split'.  If the data is too large, then consider a
> data
> > base.
> >
> > On 3/11/07, gnv shqp <gnvshqp at gmail.com> wrote:
> > >
> > > Hi R-experts,
> > >
> > > I have data from four conditions of an experiment.  I tried to create
> four
> > > subsets of the data with read.table, for example,
> > > read.table("Experiment.csv",subset=(condition=="1"))
> > > .  I found a similar post in the archive, but the answer to that post
> was
> > > no.   Any  new ideas about  reading subsets of data with read.table?
> > >
> > > Thanks!
> > >
> > > Feng
> > >
> > >        [[alternative HTML version deleted]]
> > >

Confidentiality Notice: This e-mail message, including any a...{{dropped}}



More information about the R-help mailing list