[R] RSQLite slowness

Thu Oct 7 10:58:09 CEST 2021

On Wed,  6 Oct 2021 16:23:15 +0000
Rasmus Liland <jral using posteo.no> wrote:

> 	 "SELECT * FROM gene2refseq
> 	  LEFT JOIN gene_info ON
> 	    gene_info.GeneID = gene2refseq.GeneID
> 	  WHERE gene2refseq.`RNA_nucleotide_accession.version` 
> 	    LIKE ?"

<...>

> 	x1 <- DBI::dbGetQuery(conn=conn, 
> 	  statement=statement, 
> 	  param=list(Håkan20210914$RNANucleotideAccession))

I think that the problem here is that you pass a vector as a bound
parameter to LIKE, when parameter placeholders usually expect a scalar.
DBI transparently handles this:

>> The elements of the `params` argument do not need to be scalars,
>> vectors of arbitrary length (including length 0) are supported.  For
>> queries, calling dbFetch() binding such parameters returns
>> concatenated results, equivalent to binding and fetching for each
>> set of values and connecting via rbind().

I think this means that DBI runs a SELECT for each value in
Håkan20210914$RNANucleotideAccession, which is understandably slower
than a single query. Unfortunately, it's hard to pass vectors of values
to queries with bound parameters; the SQL engines I know don't have a
syntax for "WHERE param IN (:multi_placeholder:)". SQLite comes with
carray [1], but I don't know whether it's exposed by RSQLite (could be
hard to do in a pointer-safe way), and you're already aware of the
traditional way of doing that: create a temporary table, populate it
and JOIN with the rest of the query.

-- 
Best regards,
Ivan

[1] https://www.sqlite.org/carray.html