[BioC] GEOmetadb and gse organism type

Mon Mar 23 14:17:17 CET 2009

Hi Wacek,

Please see my answer to the web interface issue below.  Please let me know
if you have any questions.

Jack

On 3/20/09 3:42 PM, "Wacek Kusnierczyk"
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:

> Laurent Gautier wrote:
>> Wacek Kusnierczyk wrote:
>>> 
>>> i agree with that having multiple comma-separated organism entries for a
>>> single data set or data series denormalizes the database.  but enforcing
>>> normalization is good only if it helps, not hinders using the data.  i
>>> think that having such an organism summary for a data series is
>>> desirable for the purpose of filtering the data by organism, and while a
>>> complex sql query can achieve that, it's easier and more efficient to
>>> check a field directly in the gse table.
>> 
>> I'd be on Sean's side regarding the choices made.
>> 
>> The habit to represent one-to-many or many-to-many associations with
>> comma-separated strings is an unfortunate one, I think.
> 
> yes, but only as long as it is disturbing rather than helpful.
> 
>> It will be far more efficient (in terms of usage of computing
>> ressources) to filter data series with a query.
> 
> not sure what you mean by 'far more efficient'.  where the data does not
> change too often, denormalizing preprocessing often leads to
> considerable speedup of subsequent queries -- it all depends on the
> shape of data.  here, you have some 250000 rows in the gsm table, but
> only a few samples per data series, which means that an on-the-fly join
> may actually be slower than using cached result from a previous join,
> stored as a comma-separated multiitem field.
> 
> in an earlier post, i provided an sql statement that creates an
> additional table, xgse (for 'extended gse').  the following are timings
> for retrieving all data series that involve at least one human sample:
> 
>     time sqlite GEOmetadb.sqlite '
>        select distinct gse
>           from gse
>           natural join gse_gsm
>           natural join gsm
>           where organism_ch1 like "%homo%" or organism_ch2 like "%homo%"
>        ' > /dev/null
>     # 3.353 s
> 
>     time sqlite GEOmetadb.sqlite '
>        select distinct gse
>           from xgse
>           where sample_organism_ch1 like "%homo%" or sample_organism_ch2
> "%homo%"
>        ' > /dev/null
>     # 0.031
> 
> the 'far more efficient' query is actually 2 orders of magnitude
> slower.  (or did you mean something different?)
> 
>> When it comes to the ease with which one can can achieve it, a
>> function performing the SQL query and returning the result can always
>> be written and distributed for all to use.
> 
> 
> sure.
> 
> 
> another issue:  as far as i can see, the web interface to geometadb
> allows one to filter data series by organism, though it's not clear to
> me whether this means platform organism, sample organism, both, or any.
> it might be good to clarify this.

As you can see, on the web interface of the GEOmetadb application (
http://gbnci.abcc.ncifcrf.gov/geo/geo_search.php ) you can search for anyone
of the three GEO entities (GPL, GSE and GSM) with any combination of terms
from any of the three.  The design provides an effective way, in my opinion,
to do crossing GEO type search, which usually needs a complex SQL join
command to achieve from command line. It allows you search GSE with an
organism name.  It also includes MySQL feature, Fulltext serach, in it which
allows Boolean mode.   Behind the scene is a complex SQL join expression,
the same as Wacek wrote.

> 
> but the web interface won't display the organism in the output table
> (tell me i'm wrong) -- which means, a user can filter by organism, but
> cannot see the organism of data series retrieved with whatever
> criteria.  it might be useful to extend the interface.
> 
> vQ
> 
>