[BioC] GEOmetadb and gse organism type

Laurent Gautier laurent at cbs.dtu.dk
Fri Mar 20 12:01:30 CET 2009


Wacek Kusnierczyk wrote:
> Sean Davis wrote:
>>> if a data set can have multiple organisms in the platform_organism or
>>> sample_organism fields, what is the point in data series not having
>>> organism fields at all?  just like gds2302 has several organisms listed
>>> in the sample_organism field, and gds2349 has several organisms listed
>>> in the platform_organism field, so could data series have.  it wouldn't
>>> have to mean that all samples and all platforms involved in a data
>>> series include material from all of the species listed in the respective
>>> fields of the data series record.
>>>
>>> such an update to the database is a matter of a fairly simple statement,
>>> as far as i can see.
>>>
>>>     
>> There are a couple of reasons for not doing so:
>>
>> 1)  GEO does not provide such a field, so we do not either.  As for GDS and
>> sample_organism and platform_organism, we have parsed those directly from
>> GEO.
>>
>> 2)  Doing so violates a rule of database design (it denormalizes the
>> database).
>>
>> Of course, the power of SQL is that you can manipulate the data in your
>> local instance as you like, so you do not have to stick to our ideas of what
>> the best database design is.
>>   
> 
> i agree with that having multiple comma-separated organism entries for a
> single data set or data series denormalizes the database.  but enforcing
> normalization is good only if it helps, not hinders using the data.  i
> think that having such an organism summary for a data series is
> desirable for the purpose of filtering the data by organism, and while a
> complex sql query can achieve that, it's easier and more efficient to
> check a field directly in the gse table.

I'd be on Sean's side regarding the choices made.

The habit to represent one-to-many or many-to-many associations with 
comma-separated strings is an unfortunate one, I think. It will be far 
more efficient (in terms of usage of computing ressources) to filter 
data series with a query. When it comes to the ease with which one can 
can achieve it, a function performing the SQL query and returning the 
result can always be written and distributed for all to use.


L.



> 
> i can of course do the preprocessing locally, but i think it could be
> useful for others to have that done in the master copy, too.
> 
> take it as a hint, not as a critique -- i appreciate your work woth
> GEOmetadb.
> 
> vQ
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list