[R] How to Read a Large CSV into a Database with R

Mike Marchywka marchywka at hotmail.com
Mon Nov 15 20:28:34 CET 2010









----------------------------------------
> From: ajdamico at gmail.com
> Date: Mon, 15 Nov 2010 13:28:40 -0500
> To: ggrothendieck at gmail.com; r-help at r-project.org
> Subject: Re: [R] How to Read a Large CSV into a Database with R
>
> Hi Gabor,
>
> Thank you for your willingness to help me through this. The code you sent
> works on my machine exactly the same way as it does on yours.
> Unfortunately, when I run the same code on the 1.3GB file, it creates the
> table structure but doesn't read in a single line [confirmed with
> sqldf("select * from mytab",dbname="mydb")] Though I don't expect anyone to
> download it, the file I'm using is ss09pusa.csv from
> http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip. I tested both sets of
> code on my work desktop and personal laptop, so it's not machine-specific

>
> Do you have any other ideas as to how I might diagnose what's going on
> here? Or, alternatively, is there some workaround that would get this giant
> CSV into a database? If you think there's a reasonable way to use the
> IMPORT command with RSQLite, that seems like it would import the fastest,



I think someone else suggested external aproaches and indeed I hae
loaded census tiger filers into db for making maps and mobile apps etc.
I wold mention again that this may eliminate a memory limit and let you 
limp along but presumably you want a streaming source or something
if your analysis has preidctable access patterns and this data will not
be used as part of a hotel reservation system.

Structured input data, I think TIGER was line oriented, should be easy
to load into a db with bash script or java app.


>
> Thanks again!
> Anthony
>
>
> > read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv",
> sql = "create table mytab as select * from file", dbname = "mydb")
> NULL
> Warning message:
> closing unused connection 3 (R:\American Community
> Survey\Data\2009\ss09pusa.csv)
> >
> > # how many records are in table?
> > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
> >
> > # read in chunks and display what we have read
> >
> > k <- 4 # no of records to read at once
> > for(i in seq(0, N-1, k)) {
> + s <- sprintf("select * from mytab limit %d, %d", i, k)
> + print(sqldf(s, dbname = "mydb"))
> + }
> Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument
> > N
> [1] 0
>
>
>
>
> On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck <
> ggrothendieck at gmail.com> wrote:
>
> > On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico 
> > wrote:
> > > Hi Gabor,
> > >
> > > Thank you for the prompt reply. I definitely looked over all of the
> > > examples on the code.google.com sqldf page before sending, which is why
> > I
> > > wrote the code
> > >
> > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> > > file" , dbname="sqlite")
> > >
> > > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql =
> > "create
> > > table mytab as select * from file",dbname = "mydb")
> > >
> > > ..but I don't understand why this helps me around the memory problem,
> > since
> > > I think it still all gets read into memory. Is there a way to do this
> > line
> > > by line?
> >
> >
> > OK. Maybe its something else.
> >
> > The reading in of the file into the database should not be a resource
> > problem provided you have enough disk space and appropriate
> > permissions. sqldf / RSQLite are used to get sqlite to do it so that
> > the data never goes through R at that stage so R limitations can't
> > affect the reading in to the sqlite database. When you read it from
> > the sqlite database then R limitations come into effect so you just
> > have to be sure not to read too much in at a time. The use of create
> > table ... as select ... is to prevent sqldf from deleting the table
> > since sqldf is normally used in a fashion where you don't want to know
> > about the back end databases so it tries to create them and delete
> > them behind the scenes but here you want to explicitly use them so
> > you have to work around that.
> >
> > Try this example. It should be reproducible so you just have to copy
> > it and paste it into your R session. Uncomment the indicated line if
> > you want to be able to remove any pre-existing mydb file in the
> > current directory. Try it in a fresh R session just to be sure that
> > nothing mucks it up.
> >
> > library(sqldf)
> >
> > # uncomment next line to make sure we are starting clean
> > # if (file.exists("mydb")) file.remove("mydb")
> >
> > # create new database
> > sqldf("attach 'mydb' as new")
> >
> > # create a new file. BOD is built into R and has 6 rows.
> > write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")
> >
> > # read new file into database
> > read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
> > dbname = "mydb")
> >
> > # how many records are in table?
> > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
> >
> > # read in chunks and display what we have read
> >
> > k <- 4 # no of records to read at once
> > for(i in seq(0, N-1, k)) {
> > s <- sprintf("select * from mytab limit %d, %d", i, k)
> > print(sqldf(s, dbname = "mydb"))
> > }
> >
> > On my machine I get this output:
> >
> > Time demand
> > 1 1 8.3
> > 2 2 10.3
> > 3 3 19.0
> > 4 4 16.0
> > Time demand
> > 1 5 15.6
> > 2 7 19.8
> >
> > showing that it read the 6 line BOD data frame in chunks of 4 as required.
> > --
> > Statistics & Software Consulting
> > GKX Group, GKX Associates Inc.
> > tel: 1-877-GKX-GROUP
> > email: ggrothendieck at gmail.com
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
 		 	   		  


More information about the R-help mailing list