[R] large dataset

Gabor Grothendieck ggrothendieck at gmail.com
Mon Mar 29 22:52:18 CEST 2010


On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <tlumley at u.washington.edu> wrote:
> On Sun, 28 Mar 2010, kMan wrote:
>
>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>
>>> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>
>> rge_data/
>>
>> Two hours is a *very* long time to transfer a csv file to a db. The author
>> of the linked article has not documented how to use scan() arguments
>> appropriately for the task. I take particular issue with the authors
>> statement that "R is said to be slow, memory hungry and only capable of
>> handling small datasets," indicating he/she has crummy informants and not
>> challenged the notion him/herself.
>
>
> Ahem.
>
> I believe that *I* am the author of the particular statement you take issue
> with (although not the of the rest of the page).
>
> However, when I wrote it, it continued:
> ---------
> "R (and S) are accused of being slow, memory-hungry, and able to handle only
> small data sets.
>
> This is completely true.
>
> Fortunately, computers are fast and have lots of memory. Data sets with  a
> few tens of thousands of observations can be handled in 256Mb of memory, and
> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more to
> handle millions of observations are still expensive (but in a few years
> Moore's Law should catch up).
>
> Tools for interfacing R with databases allow very large data sets, but this
> isn't transparent to the user."

I don`t think the last sentence is true if you use sqldf.   Assuming
the standard type of csv file accepted by sqldf:

install.packages("sqldf")
library(sqldf)
DF <- read.csv.sql("myfile.csv")

is all you need.  The install.packages statement downloads and
installs sqldf, DBI and RSQLite (which in turn installs SQLite
itself), and then read.csv.sql sets up the database and table layouts,
reads the file into the database, reads the data from the database
into R (bypassing R's read routines) and then destroys the database
all transparently.



More information about the R-help mailing list