[BioC] GOstats::geneIdUniverse() and its relation to organism annotation db's

Marc Carlson mcarlson at fhcrc.org
Wed Sep 8 19:36:25 CEST 2010


Hi Steve,

I can verify that GOstats uses "GO2ALL" mappings to do its
calculations.  Speeding this sort of comparison up is actually a lot of
why these mappings are actually around.
 

  Marc



On 09/07/2010 11:15 AM, Steve Lianoglou wrote:
> Hi Robert,
>
> In my hunt to find "the bug" I didn't stumble upon the difference
> between org.Sc.sgdGO and org.Sc.sgdGO2ALLORFS (I don't think I
> realized the latter one even existed).
>
> I'm guessing the problem I thought I saw was due to my oversight and
> is exactly as you say. Thanks for the thorough explanation and
> references.
>
> Cheers,
> -steve
>
> On Tue, Sep 7, 2010 at 9:16 AM, Robert M. Flight <rflight79 at gmail.com> wrote:
>   
>> Hi Steve,
>>
>> I wonder if you are running into the problem that generally when you
>> query for the GO terms you get only those GO terms that the ORF is
>> directly annotated with (this would correspond to the actual
>> annotations on the GO website, for instance), however when calculating
>> enrichment, not only the directly annotated terms but also the parent
>> terms (less specific) are included.
>>
>> I'm thinking of the difference between "org.Sc.sgdGO", which returns
>> "org.Sc.sgdGO is an R object that provides mappings between ORF
>> identifiers and the GO identifiers that they are directly associated
>> with. This mapping and its reverse mapping do NOT associate the child
>> terms from the GO ontology with the gene. Only the directly evidenced
>> terms are represented here.", and a mapping that returns also the
>> other GO terms in the directed acyclic graph (up the tree
>> essentially). This would be the reverse of "org.Sc.sgdGO2ALLORFS".
>>
>> For example, any gene directly annotated to "GO:0007155 : cell
>> adhesion", will also by virtue of the DAG be indirectly annotated with
>> "GO:0022610 : biological adhesion", whether or not it is directly
>> annotated with that or not. When GO enrichment using GOHyperGTest is
>> calculated, all of the indirectly annotated terms are considered as
>> well. (see this link:
>> http://amigo.geneontology.org/cgi-bin/amigo/browse.cgi?action=plus_node&target=GO:0022610&open_1=GO:0008150,all,amigo1283864733)
>>
>> This seems to be the cause of your disparity below. You might want to
>> read a bit more about the GO and how enrichment calculations are done.
>> A good paper describing the nuts and bolts of most packages is Gavin
>> Sherlocks paper on GO::TermFinder
>> (http://www.ncbi.nlm.nih.gov/pubmed/15297299), which is the basis for
>> most other enrichment tools, including GOStats.
>>
>> If I do:
>>
>> crud <- mget("GO:0007059", envir=org.Sc.sgdGO2ALLORFS)
>> crud2 <- unlist(crud)
>> grep("YKL049C", crud2)
>>
>> I get 135 as the result.
>>
>> Cheers,
>>
>> -Robert
>>
>> Robert M. Flight, Ph.D.
>> Bioinformatics and Biomedical Computing Laboratory
>> University of Louisville
>> Louisville, KY
>>
>> PH 502-852-0467
>> EM robert.flight at louisville.edu
>> EM rflight79 at gmail.com
>>
>> Williams and Holland's Law:
>>        If enough data is collected, anything may be proven by
>> statistical methods.
>>
>>
>>
>> On Mon, Sep 6, 2010 at 05:21, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>     
>>> Hi,
>>>
>>> I've been trying to do some non-run-of-the-mill gene ontology analysis
>>> and have been messing around with the innards of the results we get
>>> from GOstats::hyperGTest.
>>>
>>> As a result, I've found some peculiarities in the GO annotations I'm
>>> retrieving from the hyperGTest and how they relate to the GO
>>> annotations in my organism's (saccharomyces cerevisiae) annotation
>>> database.
>>>
>>> In particular: for some enriched GO term X, I'm finding genes listed
>>> in the geneIdUniverse of X that do not have X as a member of their
>>> organism.db.GO map, and I'm not sure how that could be. I'm even
>>> setting condition=FALSE to my hyperGTest to make it as "vanilla" as
>>> possible (but as far as I understand the conditional test, this
>>> wouldn't effect what I'm seeing anyway since genes are removed from GO
>>> terms, and not added).
>>>
>>> I have a specific example below. You'll find that for the first
>>> significant GOID found (GO:0007059), gene "YKL049C" is listed as a
>>> member of this GOID's geneIdUniverse as it is returned from the
>>> hyperGTest, but GO:0007059 isn't listed in my annotation db GO mapping
>>> (org.Sc.sgdGO) for this gene, nor is "YKL049C" found in its GO2ORF
>>> map.
>>>
>>> === Example ===
>>>
>>> library(GOstats)
>>> library(org.Sc.sgd.db)
>>> library(annotate)
>>>
>>> orfs <- c(
>>>  "YGR084C", "YDR409W",   "YFR017C",   "YDR342C",   "YBR152W",
>>>  "YIL014W", "YGR134W",   "YDL109C",   "YDL234C",   "YDR524C",
>>>  "YBR002C", "YGR063C",   "YDR034C-A", "YIR011C",
>>>  "YHR175W", "YHR109W",   "YGR207C",   "YCR023C",   "YER039C-A",
>>>  "YGL064C", "YGR249W",   "YHR172W",   "YGL226W",   "YAL064W",
>>>  "YDR309C", "YDR473C",   "YAL058W",   "YHL030W",   "YKL101W",
>>>  "YJL122W", "YBR121C",   "YIL018W",   "YDR443C",   "YBR069C",
>>>  "YBR011C", "YHR009C",   "YGL166W",   "YIL103W",   "YDR206W",
>>>  "YBL031W", "YBR162W-A", "YFL018C",   "YBR214W",   "YIL076W",
>>>  "YBR291C", "YKL049C",   "YIL149C",   "YDR034W-B", "YDR395W")
>>>
>>> orf2go <- getAnnMap('GO', 'org.Sc.sgd')
>>> go2orf <- getAnnMap('GO2ORF', 'org.Sc.sgd')
>>> all.orfs <- keys(orf2go)
>>>
>>> ## Do GOstats test to get significant GO ontologies
>>> params <- new("GOHyperGParams", geneIds=orfs,
>>>              universeGeneIds=all.orfs, annotation='org.Sc.sgd',
>>>              ontology='BP', pvalueCutoff=0.05,
>>>              conditional=FALSE, testDirection='over')
>>> GO <- hyperGTest(params)
>>>
>>> ## Look at the genes annotated in the universe for each GOID and compare
>>> ## with annotations for each gene in org.Sc.sgdGO
>>> universes <- geneIdUniverse(GO)
>>> uid <- names(universes)[1] ## GO:0007059
>>> uid.members <- universes[[uid]]
>>> "YKL049C" %in% uid.members ## TRUE
>>>
>>> ## Is the gene listed in GOID's (uid) universe annotated with that GOID?
>>> ## Given previous result, this *should* be TRUE
>>> go.ids <- orf2go[["YKL049C"]]
>>> uid %in% names(go.ids) ## FALSE
>>>
>>>
>>> ## Does that GOID have this gene as its member?
>>> ## This also should be TRUE
>>> check.orfs <- go2orf[[uid]]
>>> "YKL049C" %in% check.orfs ## FALSE
>>>
>>> ======================================
>>>
>>> Since `"YKL049C" %in% uid.members` is TRUE, I would expect the go term
>>> `uid` (GO:0007059) to be in both `names(go.ids)` and `check.orfs`, but
>>> it's not.
>>>
>>> Can someone shed some light on this for me?
>>>
>>> sessionInfo()
>>> R version 2.11.1 (2010-05-31)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>>  [1] GO.db_2.4.1          annotate_1.26.1      org.Sc.sgd.db_2.4.1
>>> GOstats_2.14.0
>>>  [5] RSQLite_0.9-2        DBI_0.2-5            graph_1.26.0
>>> Category_2.14.0
>>>  [9] AnnotationDbi_1.10.2 Biobase_2.8.0        devtools_0.1
>>> stringr_0.4
>>> [13] roxygen_0.1-2        profr_0.1.1          digest_0.4.2
>>> testthat_0.3
>>>
>>> loaded via a namespace (and not attached):
>>>  [1] GSEABase_1.10.0   RBGL_1.24.0       XML_3.1-1
>>> evaluate_0.3      genefilter_1.30.0
>>>  [6] plyr_1.1          splines_2.11.1    survival_2.35-8
>>> tools_2.11.1      xtable_1.5-6
>>>
>>>
>>> Thanks,
>>> -steve
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>       
>>     
>
>
>



More information about the Bioconductor mailing list