[BioC] GEOmetadb and gse organism type

Fri Mar 20 16:07:43 CET 2009

Sean Davis wrote:
> On Thu, Mar 19, 2009 at 5:30 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>   
>> On Thu, Mar 19, 2009 at 4:59 PM, Wacek Kusnierczyk <
>> Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>>
>>     
>>> if a data set can have multiple organisms in the platform_organism or
>>> sample_organism fields, what is the point in data series not having
>>> organism fields at all?  just like gds2302 has several organisms listed
>>> in the sample_organism field, and gds2349 has several organisms listed
>>> in the platform_organism field, so could data series have.  it wouldn't
>>> have to mean that all samples and all platforms involved in a data
>>> series include material from all of the species listed in the respective
>>> fields of the data series record.
>>>
>>> such an update to the database is a matter of a fairly simple statement,
>>> as far as i can see.
>>>
>>>       
>> There are a couple of reasons for not doing so:
>>
>> 1)  GEO does not provide such a field, so we do not either.  As for GDS and
>> sample_organism and platform_organism, we have parsed those directly from
>> GEO.
>>
>> 2)  Doing so violates a rule of database design (it denormalizes the
>> database).
>>
>> Of course, the power of SQL is that you can manipulate the data in your
>> local instance as you like, so you do not have to stick to our ideas of what
>> the best database design is.
>>
>>     
>
> Just to follow up, something like this:
>
> select gse.gse,platform_organism,sample_organism
>   from
>   (select gse.gse,group_concat(gpl.organism,',') as platform_organism
>       from gse
>          join gse_gpl gg on gg.gse=gse.gse
>          join gpl on gg.gpl=gpl.gpl
>       group by gse.gse) as o1
>   join gse on o1.gse=gse.gse (select gse.gse,group_concat(gsm.organism_ch1,',') || ',' ||
> group_concat(gsm.organism_ch2,',') as sample_organism
>       from gse join gse_gsm gs on gs.gse=gse.gse
>       join gsm on gsm.gsm=gs.gsm group by gse.gse) as o2 on gse.gse=o2.gse;
>
> could be used to create a table or view similar to what you are asking for.
>   

this will give redundant replicates for each series, which is not
necessarily bad, but could be avoided with a few modifications.  also,
i'm not sure if you're aware that some series do not have corresponding
samples:

    select count(*) from (select distinct gsm from gse_gsm);
    -- 288939
    select count(*) from (select distinct gsm from gse_gsm natural join
gsm);
    -- 268106

which looks like some 20000 entries in gse_gsm do not have a counterpart
in gsm (are these typos in gse_gsm?)

and likewise for platforms:

    select count(*) from (select distinct gpl from gpl);
    -- 5322
    select count(*) from (select distinct gpl from gse_gpl natural join
gpl);
    -- 4505

which looks like some 800 entries in gse_gpl do not have a counterpart
in gpl.

if anyone is interested, the following is a statement that will, as far
as i can see, construct a table of the shape i need -- with each gse
entry being a comination of entries from gse and three additional
fields, cumulating the organisms from the paltforms and samples involved
in the series:

create table xgse as
select
        gse.*, p.porg as platform_organism,
        s1.sorg1 as sample_organism_ch1,
        s2.sorg2 as sample_organism_ch2
    from gse
    natural join (
        select gse, group_concat(porg) as porg
            from (
                select distinct gse.gse as gse, gpl.organism as porg
                    from gse
                    left join gse_gpl on gse.gse = gse_gpl.gse
                    left join gpl on gse_gpl.gpl = gpl.gpl)
            group by gse) as p
    natural join (
        select gse, group_concat(sorg) as sorg1
            from (
                select distinct gse.gse as gse, gsm.organism_ch1 as sorg
                    from gse
                    left join gse_gsm on gse.gse = gse_gsm.gse
                    left join gsm on gse_gsm.gsm = gsm.gsm)
            group by gse) as s1
    natural join (
        select gse, group_concat(sorg) as sorg2
            from (
                select distinct gse.gse as gse, gsm.organism_ch2 as sorg
                    from gse
                    left join gse_gsm on gse.gse = gse_gsm.gse
                    left join gsm on gse_gsm.gsm = gsm.gsm)
            group by gse) as s2;

and now it's easy to query for series on a platform or with a sample
from a specific organism, with no joins.

vQ