[BioC] Odds Ratio in GOstat [resolved?]

Tue Dec 12 22:36:58 CET 2006

On Tuesday 12 December 2006 14:45, Robert Gentleman wrote:
> Sean Davis wrote:
> > On Tuesday 12 December 2006 12:38, Robert Gentleman wrote:
> >> Hi,
> >>    In principle (and I think in practice too) it is straightforward to
> >> modify GOstats (or any hypergeometric testing) to handle the situation
> >> where you believe that different ESTs represent different isoforms.
> >>
> >>    Basically you need to ensure that both the universe and the
> >> interesting gene list contain one value for all entities (ESTs here) of
> >> interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and
> >> so you cannot use them, you can however modify them, so that you get
> >> unique names for each EST (and keep the mapping to terms).
> >>    eg if EG X had three ESTs on my array, I might rename them X_1, X_2
> >> and X_3, and make sure that these are in my universe.
> >>
> >>    But I guess, if I think sequence is really that important, I would
> >> look at some sort of groupings other than GO.  I don't know, for example
> >> how well homology would work and I suspect that no one has done a
> >> comparative study. I also would worry about ISS annotations (in addition
> >> to IEA ones).
> >
> > Aren't the GO annotations typically done against a protein, and not
> > against a gene?  I think so, but someone else with more knowledge could
> > comment?  That
>
>    I don't the G in GO stands for Gene (and potentially gene product).
>
> > being the case, one could certainly blast the probe sequences against the
> > proteins to determine a better sequence-based match.  However, if one
> > searches the Gene Ontology.org database for a gene like "BRCA1", for
> > example, one actually gets several hits (representing different
> > proteins), all with
>
>    I don't see that, perhaps I am doing something wrong, but using the
> search you proposed, I find three entries for human BRCA1 (lots of other
> entries for associated genes, and other species, but each shows a
> pattern similar to that described next, AFAICS) each of the form:
>
>       BRCA1_HUMAN, BRCA1, RNF53: Breast cancer type 1 susceptibility
> protein protein from Homo sapiens, data from UniProt (P38398), assigned by
> MGI
>
>    all use the same UniProt ID, the differences are who provides the
> data, MGI, PINC and UniProt in this case. If you follow the link to
> Uniprot, for the protein ID, you see a number of transcripts associated
> with that Uniprot ID. And I see only one Entrez ID, 9606.
>
>    So I could be missing something, but I do really think it is
> essentially at the gene level (not at the sequence level).

I stand corrected.  Looks like you are right.

Sean