[BioC] GEOmetadb and gse organism type

Laurent Gautier laurent at cbs.dtu.dk
Fri Mar 20 21:47:56 CET 2009


Wacek Kusnierczyk wrote:
> Laurent Gautier wrote:
>> Wacek Kusnierczyk wrote:
>>> i agree with that having multiple comma-separated organism entries for a
>>> single data set or data series denormalizes the database.  but enforcing
>>> normalization is good only if it helps, not hinders using the data.  i
>>> think that having such an organism summary for a data series is
>>> desirable for the purpose of filtering the data by organism, and while a
>>> complex sql query can achieve that, it's easier and more efficient to
>>> check a field directly in the gse table.
>> I'd be on Sean's side regarding the choices made.
>>
>> The habit to represent one-to-many or many-to-many associations with
>> comma-separated strings is an unfortunate one, I think. 
> 
> yes, but only as long as it is disturbing rather than helpful.
> 
>> It will be far more efficient (in terms of usage of computing
>> ressources) to filter data series with a query. 
> 
> not sure what you mean by 'far more efficient'.  where the data does not
> change too often, denormalizing preprocessing often leads to
> considerable speedup of subsequent queries -- it all depends on the
> shape of data.  here, you have some 250000 rows in the gsm table, but
> only a few samples per data series, which means that an on-the-fly join
> may actually be slower than using cached result from a previous join,
> stored as a comma-separated multiitem field.
> 
> in an earlier post, i provided an sql statement that creates an
> additional table, xgse (for 'extended gse').  the following are timings
> for retrieving all data series that involve at least one human sample:
> 
>     time sqlite GEOmetadb.sqlite '
>        select distinct gse
>           from gse
>           natural join gse_gsm
>           natural join gsm
>           where organism_ch1 like "%homo%" or organism_ch2 like "%homo%"
>        ' > /dev/null
>     # 3.353 s
> 
>     time sqlite GEOmetadb.sqlite '
>        select distinct gse
>           from xgse
>           where sample_organism_ch1 like "%homo%" or sample_organism_ch2
> "%homo%"
>        ' > /dev/null
>     # 0.031
> 
> the 'far more efficient' query is actually 2 orders of magnitude
> slower.  (or did you mean something different?)

No. The comment is general and about a commonly found attraction for 
fields in relational databases populated with comma-separated values.

Note also that a frequent use-case is to query for an exact term, such 
as 'Homo sapiens'. When the field is indexed, this makes the searches 
faster.

For a particular query, I would also use optimization hints before
going for such denormalization (use EXPLAIN QUERY PLAN with SQLite).

Otherwise, should optimization through a static representation be 
needed, a materialized view can be used. I am not certain this is 
possible with SQLite, but a cheap workaround would be to store a table 
with IDs from the join. This would keep the possibility to perform exact 
matches.

We might now be entering less trivial aspects of querying, but this is
where canned queries (like in a function) can help make it easily 
accessible to all users.

>> When it comes to the ease with which one can can achieve it, a
>> function performing the SQL query and returning the result can always
>> be written and distributed for all to use.
> 
> 
> sure.
> 
> 
> another issue:  as far as i can see, the web interface to geometadb
> allows one to filter data series by organism, though it's not clear to
> me whether this means platform organism, sample organism, both, or any. 
> it might be good to clarify this.
> 
> but the web interface won't display the organism in the output table
> (tell me i'm wrong) -- which means, a user can filter by organism, but
> cannot see the organism of data series retrieved with whatever
> criteria.  it might be useful to extend the interface.
> 
> vQ
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list