[BioC] GEOmetadb and gse organism type

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu Mar 19 21:59:50 CET 2009


Sean Davis wrote:
> On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk <
> Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>
>   
>> the question:  is there no direct way to check the sample organism for
>> data series entries in GEO?  it would be very useful to filter gse by
>> sample organism, without having to join the table with gse_gsm and gsm.
>> you already have it for gds, it would be pretty simple to add an
>> analogous field to gse.  (and likewise for platform organism).  surely,
>> a gse may include more than one organism in the platforms or samples,
>> but that's what we have in gds already:
>>
>>    sqlite GEOmetadb.sqlite '
>>        select platform_organism as po, sample_organism as so
>>            from gds
>>            where po like "%,%" or so like "%,%"'
>>
>> so i would like to be able to
>>
>>    sqlite GEOmetadb.sqlite '
>>        select gds, sample_organism as so, platform_organism as po from gds
>>          where so like "%homo%"
>>            and po like "%homo%"'
>>
>> rather than
>>
>>    sqlite GEOmetadb.sqlite '
>>        select distinct gse.gse, gsm.
>>            from gse
>>                join gse_gsm on gse.gse = gse_gsm.gse
>>                join gsm on gsm.gsm = gse_gsm.gsm
>>                join gse_gpl on gse.gse = gse_gpl.gse
>>                join gpl on gpl.gpl = gse_gpl.gpl
>>            where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2
>> like "%homo%")
>>                and gpl.organism like "%homo%"'
>>
>> after all, gds are backed by gse, so if gds entries can have
>> sample_organism and platform_organism, why could not gse have them too?
>>
>>     
>
> A GSE can have samples from multiple platforms and can, therefore, represent
> multiple organisms.  That is why doing the join to platform (if you are
> willing to trust that the platform was used for the specified organism) or a
> join to samples is necessary.  A GDS, on the other hand, does not have that
> characteristic and represents a set of samples from a single platform.
>   

from a single platform, but then these:

    select gds, platform_organism from gds where platform_organism like
"%,%";
    # 7 data sets

indicate mixed-organism platforms, and it is organisms i was originally
interested in.

and, likewise,

    select gds, sample_organism from gds where sample_organism like "%,%";
    # 8 data sets

some data sets seem to contain data from more than one organism.

if a data set can have multiple organisms in the platform_organism or
sample_organism fields, what is the point in data series not having
organism fields at all?  just like gds2302 has several organisms listed
in the sample_organism field, and gds2349 has several organisms listed
in the platform_organism field, so could data series have.  it wouldn't
have to mean that all samples and all platforms involved in a data
series include material from all of the species listed in the respective
fields of the data series record.

such an update to the database is a matter of a fairly simple statement,
as far as i can see. 

best,
vQ



More information about the Bioconductor mailing list