[BioC] Odds Ratio in GOstat [resolved?]

Tue Dec 12 11:27:26 CET 2006

Dear Naomi,

if I understand you right, your problem seems to be, that you 
investigate  the classifications of the best hits of the sequenced 
organism and not the classes of your actual ESTs.

In this case, the route I usually take is to transfer the ontological 
terms onto the ESTs (or better unigenes) and use these for testing. (I 
use neither GO nor GOstats though).
 From a biological point of view I think this also makes sense. Just 
assume your sequenced species has one isoform of a particular enzyme 
(B), which has expanded to two isoforms (B1 and B2) already, which are 
not yet completely subfunctionalized etc. So in this case your 
non-sequenced organism really has two times GO:molecular_function:whatever.
And also I am more interested in the distribution of genes the organism 
I am looking at than an already sequenced one. As an extreme case if you 
inferred GO terms by blasting plants against vertebrates, you will run 
into the problem of the super expanded gene families in plants (which 
are for real).

So to answer your question I would say 3 out of 5.

However, it is not trivial to transfer ontological terms especially if 
the original were already "inferred from electronic annotation". Also if 
you are not so sure about sequence clustering processes (e.g. ESTs B1 
and B2 should really represent one unigene) things start getting shaky.
But there are annotation packages like Interpro2GO, blast2go and you 
name it.
So to sum this up, I think you should rely on good old sequence based 
bioinformatics.

Just my 5 cents though....

Cheers,
Björn

Naomi Altman wrote:
> The duplicate genes problem is an interesting one.  The reason the 
> selected gene list includes duplicates is because it comes from 
> blasting an EST set from an unsequenced species against a sequenced 
> species.  The duplicates are supposed to be the nearest homolog of 
> the EST but to represent multiple genes.  How to handle this for GO 
> enrichment is an interesting question.
> 
> e.g.  Annotation has genes A B C.
> We observe that matches A1 A2 and B1 are upregulated, but  B2 and C 
> are not.  Should we say that 3 out of 5 are upregulated, or 2 out of 3?
> 
> --Naomi
> 
> At 07:43 PM 12/11/2006, Seth Falcon wrote:
>> The selected gene list contained duplicate ids.  I'm pretty sure this
>> is the problem.  The Category + GOstats code should detect such input
>> errors and give a sensible error message.  I will add such checking
>> very soon.
>>
>> + seth
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
-+-+-+-+-+-+-+-+-+-+-+-
Björn Usadel, PhD

Max Planck Institute of Molecular Plant Physiology
System Regulation Group

Am Mühlenberg 1
D-14476 Golm
Germany

Tel    (+49 331) 567-8114

Email  usadel at mpimp-golm.mpg.de
WWW    mapman.mpimp-golm.mpg.de