[Rd] RSQLite indexing

Kasper Daniel Hansen khansen at stat.Berkeley.EDU
Tue Oct 23 03:07:18 CEST 2007


On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:

>
> I am trying to use RSQLite for storing data and  I need to create  
> indexes on
> two variables in the table. It appears from searching the web that  
> the CREATE
> INDEX operation in SQLite is relatively slow for large files, and  
> this has been
> my experience as well.
>
> The two index variables are crossed. One has about 350,000 levels  
> [yes, it's
> genetic association data]. The other will have about 4000 levels  
> eventually,
> but is up to about 100 now.   When the data were entered they were  
> already ordered by this second index variable.
>
> Creating the index took about an hour on the 100-level, presorted  
> variable and about 12 hours on the 350,000-level unsorted  
> variable.  I'm looking for advice on how to reduce this. Specifically
> 1/ would it be faster if the variable with more levels was the  
> presorted one?
> 2/ would it be faster or slower if the index were created before  
> adding all the data?
> 3/ are there any options that can be set to speed up the indexing?
>
> The SQLite database will not be the primary archive for the data,  
> so optimizations that are risky in the case of power loss or  
> hardware failure are still acceptable.  Since Bioconductor seems to  
> use SQLite a lot I'm hoping there is some simple solution.

I have not used RSQLite, but have some experience doing this thing  
for big sqlite databases using the command line client. Every  
database in sqlite has a number of parameters associated with it. You  
want to make sure that cache_size is at _most_ 2000 (yes, I know this  
is totally counterintuitive as it tells sqlite to use as little  
memory as possible). You also tell it to be non-synchronous. In  
sqlite the commands are
sqlite> pragma default_cache_size = 2000
sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a
sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the  
database. I have no idea how RSQlite handles it.

When I asked about this problem on the sqlite mailing list, the  
sqlite-creator said that this was a "locality of reference problem"  
and that it was being "worked on". And that I could search the  
archives for more info (which did not help me back then).

I don't know whether or not sorting helps.

Another thing to do is to check in what amount sqlite sits idle while  
doing I/O. It is probably impossible to avoid some idleness with such  
a thing, but it should of course be kept to a minimum.

It is true that some of the other databases are probably much faster  
at creating indices. But in the post-index analysis, sqlite is a  
really fast database, probably amongst the fastest there is. It does  
not do a good job a converting your queries into smart queries, but  
if you are doing something simple, it is blazingly fast with the  
right user options.

Kasper



More information about the R-devel mailing list