[R] Scanning only specific columns into R from a VERY large file

Sat Apr 17 00:54:51 CEST 2010

On 2010-04-16 16:21, Sharpie wrote:
>
>
> Josh B-3 wrote:
>>
>> Hi,
>>
>> I turn to you, the R Sages, once again for help. You've never let me down!
>>
>> (1) Please make the following toy files:
>>
>> x<- read.table(textConnection("var.1 var.2 var.3 var.1000
>> indv.1 1 5 9 7
>> indv.210000 2 9 3 8"), header = TRUE)
>>
>> y<- read.table(textConnection("var.3 var.1000"), header = TRUE)
>>
>> write.csv(x, file = "x.csv")
>> write.csv(y, file = "y.csv")
>>
>> (2) Pretend you are starting with the files "x.csv" and "y.csv." They come
>> from another source -- an online database. Pretend that these files are
>> much, much, much larger. Specifically:
>>      (a) Pretend that "x.csv" contains 1000 columns by 210,000 rows.
>>      (b) "y.csv" contains just header titles. Pretend that there are 90
>> header titles in "y.csv" in total. These header titles are a subset of the
>> header titles in "x.csv."
>>
>> (3) What I want to do is scan (or import, or whatever the appropriate word
>> is) only a subset of the columns from "x.csv" into an R. Specifically, I
>> only want to scan the columns of data from "x.csv" into R that are
>> indicated in the file "y.csv." I still want to scan in all 210000 rows
>> from "x.csv," but only for the aforementioned columns listed in "y.csv."
>>
>> Can you guys recommend a strategy for me? I think I need to use the scan
>> command, based on the hugeness of "x.csv," but I don't know what exactly
>> to do. Specific code that gets the job done would be the most useful.
>>
>> Thank you very much in advance!
>> Josh
>>
>
> read.csv.sql() from the sqldf package looks like it may do what you want- it
> allows you to filter what gets read in from a CSV file using SQL statements,
> something like:
>
>    SELECT list,of,column,names FROM file
>
>
> Hope this helps!
>
> -Charlie

That's probably the best way. A crude way might be to
read in one row from each file (using nrow = 1), then
use the names to define a colClasses vector whose
elements are NA for columns to be read and "NULL" for
columns to be skipped, and then read x.csv with that
colClasses vector.
I have no idea how slow this would be.

  -Peter Ehlers