[BioC] GEOmetadb and gse organism type

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Fri Mar 20 11:42:23 CET 2009


Sean Davis wrote:
>
>> if a data set can have multiple organisms in the platform_organism or
>> sample_organism fields, what is the point in data series not having
>> organism fields at all?  just like gds2302 has several organisms listed
>> in the sample_organism field, and gds2349 has several organisms listed
>> in the platform_organism field, so could data series have.  it wouldn't
>> have to mean that all samples and all platforms involved in a data
>> series include material from all of the species listed in the respective
>> fields of the data series record.
>>
>> such an update to the database is a matter of a fairly simple statement,
>> as far as i can see.
>>
>>     
>
> There are a couple of reasons for not doing so:
>
> 1)  GEO does not provide such a field, so we do not either.  As for GDS and
> sample_organism and platform_organism, we have parsed those directly from
> GEO.
>
> 2)  Doing so violates a rule of database design (it denormalizes the
> database).
>
> Of course, the power of SQL is that you can manipulate the data in your
> local instance as you like, so you do not have to stick to our ideas of what
> the best database design is.
>   

i agree with that having multiple comma-separated organism entries for a
single data set or data series denormalizes the database.  but enforcing
normalization is good only if it helps, not hinders using the data.  i
think that having such an organism summary for a data series is
desirable for the purpose of filtering the data by organism, and while a
complex sql query can achieve that, it's easier and more efficient to
check a field directly in the gse table. 

i can of course do the preprocessing locally, but i think it could be
useful for others to have that done in the master copy, too.

take it as a hint, not as a critique -- i appreciate your work woth
GEOmetadb.

vQ



More information about the Bioconductor mailing list