[Rd] RSQLite indexing

Thomas Lumley tlumley at u.washington.edu
Tue Oct 23 00:57:14 CEST 2007


On Mon, 22 Oct 2007, hadley wickham wrote:

> On 10/22/07, Thomas Lumley <tlumley at u.washington.edu> wrote:
>>
>> I am trying to use RSQLite for storing data and  I need to create indexes on
>> two variables in the table. It appears from searching the web that the CREATE
>> INDEX operation in SQLite is relatively slow for large files, and this has been
>> my experience as well.
>>
>> The two index variables are crossed. One has about 350,000 levels [yes, it's
>> genetic association data]. The other will have about 4000 levels eventually,
>> but is up to about 100 now.   When the data were entered they were already ordered by this second index variable.
>>
>> Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable.  I'm looking for advice on how to reduce this. Specifically
>
> How big is your dataset?  SQLite can be slow, but I didn't think it
> was that slow.

350000x100 entries on 12 variables

>> 1/ would it be faster if the variable with more levels was the presorted one?
>
> I didn't think this would matter (and can imagine situations where it
> would be worse).  Is there a particular reason you think this might
> help?

Mainly that it would be nice if it did.  Preordering might help more for the larger number of levels because the smaller number of levels gives more valid places to insert a record into the index.  I could just experiment, but since it takes so long I thought it was worth asking if anyone had actual knowledge.


>> 2/ would it be faster or slower if the index were created before adding all the data?
>
> It's generally much faster to create the index after loading all the data.

Yes. The reason I thought it might not be in this case is that the indexing operation uses remarkably little CPU (and doesn't use anywhere near the disk bandwidth) so I hoped there might be some saving in doing both at once.



>> 3/ are there any options that can be set to speed up the indexing?
>
> Have you tried 'vacuum'ing your database prior to indexing?

Yes. But it has never had anything deleted from it, so it's not surprising that it didn't help.

> Will you be indexing on both columns simultaneously?  If so, you might
> try creating a single index.

No, indexing will be on one or the other column, not on both at once.

      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle



More information about the R-devel mailing list