[BioC] GEOmetadb and gse organism type

Fri Mar 20 20:42:44 CET 2009

Laurent Gautier wrote:
> Wacek Kusnierczyk wrote:
>>
>> i agree with that having multiple comma-separated organism entries for a
>> single data set or data series denormalizes the database.  but enforcing
>> normalization is good only if it helps, not hinders using the data.  i
>> think that having such an organism summary for a data series is
>> desirable for the purpose of filtering the data by organism, and while a
>> complex sql query can achieve that, it's easier and more efficient to
>> check a field directly in the gse table.
>
> I'd be on Sean's side regarding the choices made.
>
> The habit to represent one-to-many or many-to-many associations with
> comma-separated strings is an unfortunate one, I think. 

yes, but only as long as it is disturbing rather than helpful.

> It will be far more efficient (in terms of usage of computing
> ressources) to filter data series with a query. 

not sure what you mean by 'far more efficient'.  where the data does not
change too often, denormalizing preprocessing often leads to
considerable speedup of subsequent queries -- it all depends on the
shape of data.  here, you have some 250000 rows in the gsm table, but
only a few samples per data series, which means that an on-the-fly join
may actually be slower than using cached result from a previous join,
stored as a comma-separated multiitem field.

in an earlier post, i provided an sql statement that creates an
additional table, xgse (for 'extended gse').  the following are timings
for retrieving all data series that involve at least one human sample:

    time sqlite GEOmetadb.sqlite '
       select distinct gse
          from gse
          natural join gse_gsm
          natural join gsm
          where organism_ch1 like "%homo%" or organism_ch2 like "%homo%"
       ' > /dev/null
    # 3.353 s

    time sqlite GEOmetadb.sqlite '
       select distinct gse
          from xgse
          where sample_organism_ch1 like "%homo%" or sample_organism_ch2
"%homo%"
       ' > /dev/null
    # 0.031

the 'far more efficient' query is actually 2 orders of magnitude
slower.  (or did you mean something different?)

> When it comes to the ease with which one can can achieve it, a
> function performing the SQL query and returning the result can always
> be written and distributed for all to use.

sure.

another issue:  as far as i can see, the web interface to geometadb
allows one to filter data series by organism, though it's not clear to
me whether this means platform organism, sample organism, both, or any. 
it might be good to clarify this.

but the web interface won't display the organism in the output table
(tell me i'm wrong) -- which means, a user can filter by organism, but
cannot see the organism of data series retrieved with whatever
criteria.  it might be useful to extend the interface.

vQ