[R] R & MySQL (Databases)

Sat Oct 16 16:24:43 CEST 2010

Hi!

Santosh Srinivas wrote:
> Dear R-helpers,
> 
> Considering that a substantial part of analysis is related data
> manipulation, I'm just wondering if I should do the basic data part in a
> database server (currently I have the data in .txt file).
> For this purpose, I am planning to use MySQL. Is MySQL a good way to go
> about? Are there any anticipated problems that I need to be aware of?

I'm afraid that I've no real answer to your questions but more questions 
and, perhaps, another point of view from a different world!

> 
> Considering, that many users here use large datasets. Do you typical store
> the data in databases and query relevant portions for your analysis?
> Does it speed up the entire process? Is it neater to do things in a
> database? (for e.g. errors could corrected at data import stage itself by
> conditions in defining the data itself in the database as opposed to
> discovering things when you do the analysis in R and realize something is
> wrong in the output?)

Please, what do you mean with "large datasets"?

I wouldn't consider only process speed, but also how the global 
repository of data is constructed. I mean, the problem is not only 
accession speed now, but to be able to identify any set of data in the 
future. For this, it could be so useful a RDBMS as a hierarchical folder 
structure plus well designed file names.

> 
> This is vis-à-vis using the built in SQLLite, indexing, etc capabilities in
> R itself? Does performance work better with a database backend (especially
> for simple but large datasets)?

As you said, R itself has powerful tools for data filtering 
rearrangement. I've only see some problems related with Genomics 
analysis where an external tool was required to manage some huge matrix. 
  And that was time ago and some patches where on their way to solve 
this problem within R.

What I see here, with sets of data coming from experimental designs 
pouring into Excel data sheets and analytical facilities generating big 
(around 1Gb/day) plain text files, is that we have such a huge 
variability in model structure that it will be quite expensive to 
programme interfaces to all the processes to store data in a central 
repository managed by a RDBMS.

Even worst! During these last years some of hour information has been 
moved to object oriented databases, so the problem is becoming more complex.

> 
> The financial applications that I am thinking of are not exactly realtime
> but quick response and fast performance would definitely help.
> 
> Aside info, I want to take things to a cloud environment at some point of
> time just because it will be easier and cheaper to deliver.
> 
> Kind of an open question, but any inputs will help.

As you see, there are no answers here, but more doubts as I'm in a 
similar situation. So, any idea will be extremely welcome for us!

Thanks!

-- 
Ricardo Rodríguez
Your XEN ICT Team