[R] How to import BIG csv files with separate "map"?

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue Jul 14 21:50:27 CEST 2009


Hi,

On Jul 14, 2009, at 1:53 PM, giusto wrote:

>
> Hi all,
>
> I am having problems importing a VERY large dataset in R. I have  
> looked into
> the package ff, and that seems to suit me, but also, from all the  
> examples I
> have seen, it either requires a manual creation of the database, or  
> it needs
> a read.table kind of step. Being a survey kind of data the file is  
> big (like
> 20,000 times 50,000 for a total of about 1.2Gb in plain text) the  
> memory I
> have isn't enough to do a read.table and my computer freezes every  
> time :(

Look at the documentation near the end of ?read.table:

"""Note that unless colClasses is specified, all columns are read as  
character columns and then converted. This means that quotes are  
interpreted in all fields and that a column of values like "42" will  
result in an integer column."""

So all the data is read in as characters, then R tries to convert it  
to the appropriate data type (uses mucho memory).

Perhaps specifying the types of each column in the colClasses param  
can get you where you need to be.

> This far I have managed to import the required subset of the data by  
> using a
> "cheat": I used GRETL to read an equivalent Stata file (released by  
> the same
> source that offered the csv file), manipulate it and export it in a  
> format
> that R can read into memory.

I'm not sure if you're suggesting that R can read in the whole data  
file when stored in some SPSS binary format. If so, perhaps the  
colClass trick above might work.

If the read.table w/ colClasses doesn't work (and you know you can  
load the entire dataset into R via some binary format), perhaps you'll  
have to parse the file line by line by opening it with a "file(..,  
'r')" command, and using "scan" (or readChar?) to run through the file  
w/o having to load it all into memory at once.

HTH,
-steve

--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University

Contact Info: http://cbio.mskcc.org/~lianos/contact




More information about the R-help mailing list