[R] Advantages of using SQLite for data import in comparison to csv files

Fri Jan 15 15:49:15 CET 2010

Thanks for your answer.
I've thought of the possibility of an index when using a SQLite database, but 
as you mentioned, I don't really benefit from it in regard of rather small 
data sets. 
What I am considering as a problem when using csv files is the occurrence of 
data redundancy: When I wan't to print a table in a document, it is often not  
necessary to insert all columns of a table. You might choose to leave out 
certain parameters that are aren't of interest for the reader. I feared that 
the generation of a second table with less information would be unavoidable 
in this case before including it in LaTeX with Sweave. However, after having 
tried to get started with sqldf, this powerful addition to R's data analysis 
and manipulation abilities is probably sufficient for my purposes.

Best regards,
Juliet

> RSQLite is actually much more than a "driver for data import". It
> basically allows any manipulation of SQLite databases, including
> reading, writing, or updating and munging data within the database.
>
>
> Regarding the original question of data import:
>
> I use csv and sqlite files interchangeably to store my data, and
> converting between one and the other is generally trivial (read one
> format into a data.frame and then write it into the other).
>
> For me, the key determinant is whether a given data set is so big that
> reading it into a data.frame with read.csv() every time I need it is a
> pain. I usually keep all my original data in csv files and write
> routines to write the contents of huge files into sqlite databases.
>
> sqldf is a great package as well, but it definitely does not obviate the
> need to actually be able to manipulate data base files. For one thing,
> you cannot index a csv file or a data.frame. If you have to repeatedly
> select subsets of your large data set, creating an index on the relevant
> column in the sqlite table is an absolute life saver.
>
> (However, with a 1000x20 data set, you will probably not have to worry
> about the speed of selecting from the table. Unless you need to use
> bigger data sets, the simplest way is probably to just use csv files,
> read the contents into a data.frame with read.csv, and then use sqldf on
> the data.frame if you need to do complicated subsetting)
>
> Best,
> Magnus