[R] Advantages of using SQLite for data import in comparison to csv files

Fri Jan 15 19:51:19 CET 2010

Just to comment on this bit :

> For one thing, you cannot index a csv file or a data.frame. If you have to 
> repeatedly select subsets of your large data set, creating an index on the 
> relevant column in the sqlite table is an absolute life saver.

This is one reason the data.table package was created.  It is very similar 
to a data.frame, with the addition of things like keys.  If you consider an 
index in sqlite a life saver then data.table might be up your street.

Its really simple to do this in R.

For example, lets say you have a 10,000,000 row table of 2 columns : 
id(integer), v(double).  So this is a tiny example at just 115MB i.e. 
(4+8)*1e7/1024^2.   This is many times smaller than most desktop users have 
RAM available to R on their 32bit machines,  so you can run this example 
yourself.

W'll start off with data.frame and then show how data.table compares.

> n=1e7
> DF = data.frame(id=sample(LETTERS,n,replace=TRUE),v=rnorm(n))
> head(DF)
  id           v
1  P -0.57695495
2  A -0.33388390
3  P  0.05277701
4  K  0.49610573
5  F  0.31031501
6  C  1.26861215
> dim(DF)
[1] 10000000        2
> system.time(DF[DF$id=="G",])   # table scan  i.e. the == in R tests all 
> 10m items, there is no index used as you said
   user  system elapsed
   1.23    0.17    1.41
> system.time(DF[DF$id=="G",])   #  good idea to repeat all timings in case 
> the first one was slow due to a garbage collect
   user  system elapsed
   1.36    0.17    1.53

> DT = data.table(DF)        # convert the data.frame to a data.table
> object.size(DF)
120001608 bytes
> object.size(DT)
120001640 bytes                # so a DT is the same structure as a 
data.frame taking the same space.  data.frame is efficiently stored.  Can't 
get lower than the 115MB calculation for this dataset.   data.table doesn't 
do any magic with regard to storage
>
# Now comes the fun :

> setkey(DT,id)        #  sort the data.table by id and marks it as sorted 
> by id. The key is now id.  This is analogous to building the index that 
> you do in sqllite.  You don't have to assign the result of setkey() 
> anywhere, unlike most R functions. Its a special function which operates 
> on DT and changes it directly.
> tables()                    # view all our tables in memory, theres just 
> one in this example.  It assumes you view the result in a font such as 
> courier.
     NAME       NROW  MB COLS KEY
[1,] DT   10,000,000 115 id,v id
Total: 115MB
>  # R is looking a bit like a database now.  Usually you have lots of 
> tables returned by tables(),  just like you would with sqlTables() in 
> RODBC.

Its not difficult to use the key,  its very like using rownames :

> system.time(DT["G",mult="all"])
   user  system elapsed
   0.09    0.00    0.09
> system.time(DT["G",mult="all"])      # repeat the timing test.  confirmed 
> that 1.5s reduces to 0.1s
   user  system elapsed
    0.1     0.0     0.1
> identical(   DF[DF$id=="G",]$v,     DT["G",mult="all"]$v   )
[1] TRUE
>

The mult="all" is needed because otherwise you'd just get the first row of 
the "G" group  (the default for mult is "first" when multiple rows match the 
key value)

Its very important to realise that in a data.table query you have the option 
to do table scans too e.g. :

> system.time(DT[id=="G"])     # its the == operator that results in a table 
> scan,  its not the class of the table per se
   user  system elapsed
   1.39    0.12    1.52
>

So its up to the user of data.table. The user has the option to use 
data.table badly (by not using the key),  just like you can use SQL badly 
(by not using an index)

Ok, who cares. 1 second saved. Why all the effort ?  Two reasons.  Firstly, 
try doubling the rows in DF and the time doubles to 3 seconds, double the 
rows in the DT though and that time stays constant at 0.1 seconds. Keep on 
doubling and the savings get bigger and bigger. Secondly, if you're 
repeating that 'query' inside a loop then the time saving adds up.  There 
are other advantages of data.table but it seemed appropriate just to mention 
the keys this time.

HTH

"Magnus Torfason" <zulutime.net at gmail.com> wrote in message 
news:4B4F9C70.2050804 at gmail.com...
> RSQLite is actually much more than a "driver for data import". It 
> basically allows any manipulation of SQLite databases, including reading, 
> writing, or updating and munging data within the database.
>
>
> Regarding the original question of data import:
>
> I use csv and sqlite files interchangeably to store my data, and 
> converting between one and the other is generally trivial (read one format 
> into a data.frame and then write it into the other).
>
> For me, the key determinant is whether a given data set is so big that 
> reading it into a data.frame with read.csv() every time I need it is a 
> pain. I usually keep all my original data in csv files and write routines 
> to write the contents of huge files into sqlite databases.
>
> sqldf is a great package as well, but it definitely does not obviate the 
> need to actually be able to manipulate data base files. For one thing, you 
> cannot index a csv file or a data.frame. If you have to repeatedly select 
> subsets of your large data set, creating an index on the relevant column 
> in the sqlite table is an absolute life saver.
>
> (However, with a 1000x20 data set, you will probably not have to worry 
> about the speed of selecting from the table. Unless you need to use bigger 
> data sets, the simplest way is probably to just use csv files, read the 
> contents into a data.frame with read.csv, and then use sqldf on the 
> data.frame if you need to do complicated subsetting)
>
> Best,
> Magnus
>
> On 1/14/2010 2:12 AM, Juliet Jacobson wrote:
>> But then why has the R driver for data import from a SQLite database
>> been written?
>>
>> Gabor Grothendieck wrote:
>>> You could look at read.csv.sql in sqldf (http://sqldf.googlecode.com) as 
>>> well.
>