[R] How to Read a Large CSV into a Database with R

Gabor Grothendieck ggrothendieck at gmail.com
Mon Nov 15 16:41:42 CET 2010


On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico <ajdamico at gmail.com> wrote:
> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm trying to
> insert a very large CSV file into a SQLite database.  I'm pretty new to
> working with databases in R, so I apologize if I'm overlooking something
> obvious here.
>
> I'm trying to work with the American Community Survey data, which is two
> 1.3GB csv files.  I have enough RAM to read one of them into memory, but not
> both at the same time.  So, in order to analyze them, I'm trying to get them
> into a SQLite database so I can use the R survey package's database-backed
> survey objects capabilities (
> http://faculty.washington.edu/tlumley/survey/svy-dbi.html).
>
> I need to combine both of these CSV files into one table (within a
> database), so I think that I'd need a SQL manipulation technique that reads
> everything line by line, instead of pulling it all into memory.
>
> I've tried using read.csv.sql, but it finishes without an error and then
> only shows me the table structure when I run the final select statement.
> When I run these exact same commands on a smaller CSV file, they work fine.
> I imagine this is not working because the csv is so large, but I'm not sure
> how to confirm that or what to change if it is.  I do want to get all
> columns from the CSV into the data table, so I don't want to filter
> anything.
>
> library(sqldf)
> setwd("R:\\American Community Survey\\Data\\2009")
> sqldf("attach 'sqlite' as new")
> read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> file" , dbname="sqlite")
> sqldf("select * from ss09pusa limit 3",dbname="sqlite")
>

What the above code does, which is unlikely to be what you intended,
is to create an sqlite database called 'sqlite' and then read in the
indicated file into sqlite, read it in into R from sqlite (clearly
this step will fail if the data is too big for R but it its not then
you are ok) and then delete the table from the database so your sqldf
statement should give an error since there is no such table or else if
you have a data frame in your R workspace called ss09pusa the sqldf
statement will load that into a database table and the retrieve its
first three rows and then delete the table.

This sort of task is probably more suitable for RSQLite than sqldf
but if you wish to do it with sqldf you need to follow example 9 or
example 10 on the sqldf home page:

In example 9,

http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases

its very important to note that sqldf automatically deletes any table
that it created after the sqldf or read.csv.sql statement is done so
to not have the table dropped is to make sure you issue an sql
statement that creates the table, "create table mytab as select ..."
rather than sqldf.

In example 10,

http://code.google.com/p/sqldf/#Example_10._Persistent_Connections

persistent connections are illustrated which represents an alternate
way to do this in sqldf.


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list