[R] RJDBC vs RMySQL vs ???

Wed Jun 23 23:20:11 CEST 2010

Sorry for the lack of details. Since I run the same SQL first directly
on MySQL (using the MySQL Query Browser) and then again using R
through the RJDBC interface, I assume that I won't simply have a badly
constructed SQL query. However, just to clear possible objection, here
the SQL:

# Extracts vector of data points
getData <- function(connection) {
	queryStart <- "SELECT id1, id2, x, y FROM `mytable` "
	queryEnd <- ";"
	query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(",
id2s, ") AND subtype='TYPE1'", queryEnd)
	# execute query
	data =  dbGetQuery(connection, query)
	return(data)
}

When running this method using either RGUI or the command line, I have
a runtime that reaches an incredible 10 minutes (!) for selecting
about 50k - 80k data points (which I consider not much) based on the
range of IDs I choose. The table size is about 5-8 million data points
total. The same SQL query directly executed in MySQL Query Browser
takes about 20 seconds which I would consider fine. There are no
indices created for any of the fields but since the query runs a lot
faster in the query browser I don't suspect this to be the main
reason.

Any ideas?

Best,
Ralf

On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald
<jmacdon at med.umich.edu> wrote:
> Hi Ralf,
>
> Ralf B wrote:
>>
>> I am running a simple SQL SELECT statement that involvs 50k + data
>> points using R and the RJDBC interface. I am facing very slow response
>> times in both the RGUI and the R console. When running this SQL
>> statement directly in a SQL client I have processing times that are a
>> lot lot faster (which means that the SQL statement itself is not the
>> problem).
>>
>> Did any of you compare RJDBC vs RMySQL or is there a better, more
>> efficient way to extract large data from databases using R? Would you
>> recommend dumping data out completely into flat files and working with
>> flat files instead? I expected that this would not be such a problem
>> given that businesses maintain their data in DBs and R is supposed to
>> be good in shifting around data. Am I doing something wrong?
>
> Well, if you don't show people what you have done, how can anybody tell if
> you are doing something wrong or not?
>
> I have no experience with RJDBC, so cannot say anything about that. However,
> I have always found RMySQL to be speedy enough. As an example:
>
>> library(RMySQL)
> Loading required package: DBI
>> con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user =
>> "genome", dbname = "hg18")
>> system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where
>> chrom='chr1' and chromStart between 1 and 1e8;")
> + )
>   user  system elapsed
>   7.95    0.06   38.59
>> dim(a)
> [1] 508676      2
>
> So 40 seconds to get half a million records. Since this is via the internet,
> I have to imagine things would be much faster querying a local DB.
>
> But then you never say what constitutes 'slow' for you, so maybe this is
> slow as well?
>
> Best,
>
> Jim
>
>
>>
>> Ralf
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be
> used for urgent or sensitive issues
>