[BioC] GEOmetadb and gse organism type

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Fri Mar 20 15:29:21 CET 2009

Jack Zhu wrote:
> Hi Wacek,
> Thanks for your observation and suggestion. Regarding adding a organism
> field in the gse table, actually it was considered when the package was
> designed. Since one gse could have multiple organisms and that would end up
> with delimited strings. We did not do it because: 1. delimited strings in a
> field will need to be parsed when you want to do something like counting or
> convert to other types; 

you can always use complex queries to count the original data

> 2. Keeping the things in different tables is the
> purpose of a relational database; 

yes, but the most important purpose of a database is to serve its
users;  databases do not exist just to acknowledge the relational theory ;)

> 3. GEO does not have this field and we
> intend to keep data as the same as in the GEO as possible.

that's reasonable.  however, geo is already incoherent here, since some
entries in gds.platform_organism and gds.sample_organism are already
denormalized, as are some entries in the organism fields in the gsm and
gpl tables.  if you wanted to insist on 1. and 2. above, you'd have to
parse and separate these.  and actually, to avoid redundance, the
database would best have a separate column with organism names, linked
to the other tables via an organism is (possibly the NCBI taxon id).

since there is already a little mess here (not your fault) in this
respect, i don't think adding organism columns to gse would make things
worse wrt. 1 and 2 above.  but this is only my opinion, i can obviously
have things my way locally.

> The join table sql command does not look very pretty:

it could be simplified a bit using natural joins, but you see that the
simple task of filtering gse by organism does add some ugly complexity
with the current shape of things.

>              sqlite GEOmetadb.sqlite '
>>>         select distinct gse.gse, gsm.
>>>             from gse
>>>                 join gse_gsm on gse.gse = gse_gsm.gse
>>>                 join gsm on gsm.gsm = gse_gsm.gsm
>>>                 join gse_gpl on gse.gse = gse_gpl.gse
>>>                 join gpl on gpl.gpl = gse_gpl.gpl
>>>             where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2
>>> like "%homo%")
>                 and gpl.organism like "%homo%"'
> But we could do this by steps (in R):

yes, and you can do things by steps with sql as well.  but r adds some
performance penalty, without really simplifying the matters.

thanks for your efforts.

More information about the Bioconductor mailing list