[BioC] Odds Ratio in GOstat [resolved?]

Tue Dec 12 20:45:48 CET 2006

Sean Davis wrote:
> On Tuesday 12 December 2006 12:38, Robert Gentleman wrote:
>> Hi,
>>    In principle (and I think in practice too) it is straightforward to
>> modify GOstats (or any hypergeometric testing) to handle the situation
>> where you believe that different ESTs represent different isoforms.
>>
>>    Basically you need to ensure that both the universe and the
>> interesting gene list contain one value for all entities (ESTs here) of
>> interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and
>> so you cannot use them, you can however modify them, so that you get
>> unique names for each EST (and keep the mapping to terms).
>>    eg if EG X had three ESTs on my array, I might rename them X_1, X_2
>> and X_3, and make sure that these are in my universe.
>>
>>    But I guess, if I think sequence is really that important, I would
>> look at some sort of groupings other than GO.  I don't know, for example
>> how well homology would work and I suspect that no one has done a
>> comparative study. I also would worry about ISS annotations (in addition
>> to IEA ones).
> 
> Aren't the GO annotations typically done against a protein, and not against a 
> gene?  I think so, but someone else with more knowledge could comment?  That 

   I don't the G in GO stands for Gene (and potentially gene product).

> being the case, one could certainly blast the probe sequences against the 
> proteins to determine a better sequence-based match.  However, if one 
> searches the Gene Ontology.org database for a gene like "BRCA1", for example, 
> one actually gets several hits (representing different proteins), all with 

   I don't see that, perhaps I am doing something wrong, but using the 
search you proposed, I find three entries for human BRCA1 (lots of other 
entries for associated genes, and other species, but each shows a 
pattern similar to that described next, AFAICS) each of the form:

      BRCA1_HUMAN, BRCA1, RNF53: Breast cancer type 1 susceptibility protein
protein from Homo sapiens, data from UniProt (P38398), assigned by MGI

   all use the same UniProt ID, the differences are who provides the 
data, MGI, PINC and UniProt in this case. If you follow the link to 
Uniprot, for the protein ID, you see a number of transcripts associated 
with that Uniprot ID. And I see only one Entrez ID, 9606.

   So I could be missing something, but I do really think it is 
essentially at the gene level (not at the sequence level).

   best wishes
     Robert

> slightly different ontology entries.  This phenomenon is likely due to a 
> mixture of important biology and varying levels of evidence, making the 
> exercise seem questionable at best.  
> 
> Sean
> 
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org