[BioC] GOstats::geneIdUniverse() and its relation to organism annotation db's

Robert M. Flight rflight79 at gmail.com
Tue Sep 7 15:16:04 CEST 2010


Hi Steve,

I wonder if you are running into the problem that generally when you
query for the GO terms you get only those GO terms that the ORF is
directly annotated with (this would correspond to the actual
annotations on the GO website, for instance), however when calculating
enrichment, not only the directly annotated terms but also the parent
terms (less specific) are included.

I'm thinking of the difference between "org.Sc.sgdGO", which returns
"org.Sc.sgdGO is an R object that provides mappings between ORF
identifiers and the GO identifiers that they are directly associated
with. This mapping and its reverse mapping do NOT associate the child
terms from the GO ontology with the gene. Only the directly evidenced
terms are represented here.", and a mapping that returns also the
other GO terms in the directed acyclic graph (up the tree
essentially). This would be the reverse of "org.Sc.sgdGO2ALLORFS".

For example, any gene directly annotated to "GO:0007155 : cell
adhesion", will also by virtue of the DAG be indirectly annotated with
"GO:0022610 : biological adhesion", whether or not it is directly
annotated with that or not. When GO enrichment using GOHyperGTest is
calculated, all of the indirectly annotated terms are considered as
well. (see this link:
http://amigo.geneontology.org/cgi-bin/amigo/browse.cgi?action=plus_node&target=GO:0022610&open_1=GO:0008150,all,amigo1283864733)

This seems to be the cause of your disparity below. You might want to
read a bit more about the GO and how enrichment calculations are done.
A good paper describing the nuts and bolts of most packages is Gavin
Sherlocks paper on GO::TermFinder
(http://www.ncbi.nlm.nih.gov/pubmed/15297299), which is the basis for
most other enrichment tools, including GOStats.

If I do:

crud <- mget("GO:0007059", envir=org.Sc.sgdGO2ALLORFS)
crud2 <- unlist(crud)
grep("YKL049C", crud2)

I get 135 as the result.

Cheers,

-Robert

Robert M. Flight, Ph.D.
Bioinformatics and Biomedical Computing Laboratory
University of Louisville
Louisville, KY

PH 502-852-0467
EM robert.flight at louisville.edu
EM rflight79 at gmail.com

Williams and Holland's Law:
       If enough data is collected, anything may be proven by
statistical methods.



On Mon, Sep 6, 2010 at 05:21, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> I've been trying to do some non-run-of-the-mill gene ontology analysis
> and have been messing around with the innards of the results we get
> from GOstats::hyperGTest.
>
> As a result, I've found some peculiarities in the GO annotations I'm
> retrieving from the hyperGTest and how they relate to the GO
> annotations in my organism's (saccharomyces cerevisiae) annotation
> database.
>
> In particular: for some enriched GO term X, I'm finding genes listed
> in the geneIdUniverse of X that do not have X as a member of their
> organism.db.GO map, and I'm not sure how that could be. I'm even
> setting condition=FALSE to my hyperGTest to make it as "vanilla" as
> possible (but as far as I understand the conditional test, this
> wouldn't effect what I'm seeing anyway since genes are removed from GO
> terms, and not added).
>
> I have a specific example below. You'll find that for the first
> significant GOID found (GO:0007059), gene "YKL049C" is listed as a
> member of this GOID's geneIdUniverse as it is returned from the
> hyperGTest, but GO:0007059 isn't listed in my annotation db GO mapping
> (org.Sc.sgdGO) for this gene, nor is "YKL049C" found in its GO2ORF
> map.
>
> === Example ===
>
> library(GOstats)
> library(org.Sc.sgd.db)
> library(annotate)
>
> orfs <- c(
>  "YGR084C", "YDR409W",   "YFR017C",   "YDR342C",   "YBR152W",
>  "YIL014W", "YGR134W",   "YDL109C",   "YDL234C",   "YDR524C",
>  "YBR002C", "YGR063C",   "YDR034C-A", "YIR011C",
>  "YHR175W", "YHR109W",   "YGR207C",   "YCR023C",   "YER039C-A",
>  "YGL064C", "YGR249W",   "YHR172W",   "YGL226W",   "YAL064W",
>  "YDR309C", "YDR473C",   "YAL058W",   "YHL030W",   "YKL101W",
>  "YJL122W", "YBR121C",   "YIL018W",   "YDR443C",   "YBR069C",
>  "YBR011C", "YHR009C",   "YGL166W",   "YIL103W",   "YDR206W",
>  "YBL031W", "YBR162W-A", "YFL018C",   "YBR214W",   "YIL076W",
>  "YBR291C", "YKL049C",   "YIL149C",   "YDR034W-B", "YDR395W")
>
> orf2go <- getAnnMap('GO', 'org.Sc.sgd')
> go2orf <- getAnnMap('GO2ORF', 'org.Sc.sgd')
> all.orfs <- keys(orf2go)
>
> ## Do GOstats test to get significant GO ontologies
> params <- new("GOHyperGParams", geneIds=orfs,
>              universeGeneIds=all.orfs, annotation='org.Sc.sgd',
>              ontology='BP', pvalueCutoff=0.05,
>              conditional=FALSE, testDirection='over')
> GO <- hyperGTest(params)
>
> ## Look at the genes annotated in the universe for each GOID and compare
> ## with annotations for each gene in org.Sc.sgdGO
> universes <- geneIdUniverse(GO)
> uid <- names(universes)[1] ## GO:0007059
> uid.members <- universes[[uid]]
> "YKL049C" %in% uid.members ## TRUE
>
> ## Is the gene listed in GOID's (uid) universe annotated with that GOID?
> ## Given previous result, this *should* be TRUE
> go.ids <- orf2go[["YKL049C"]]
> uid %in% names(go.ids) ## FALSE
>
>
> ## Does that GOID have this gene as its member?
> ## This also should be TRUE
> check.orfs <- go2orf[[uid]]
> "YKL049C" %in% check.orfs ## FALSE
>
> ======================================
>
> Since `"YKL049C" %in% uid.members` is TRUE, I would expect the go term
> `uid` (GO:0007059) to be in both `names(go.ids)` and `check.orfs`, but
> it's not.
>
> Can someone shed some light on this for me?
>
> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-apple-darwin9.8.0
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] GO.db_2.4.1          annotate_1.26.1      org.Sc.sgd.db_2.4.1
> GOstats_2.14.0
>  [5] RSQLite_0.9-2        DBI_0.2-5            graph_1.26.0
> Category_2.14.0
>  [9] AnnotationDbi_1.10.2 Biobase_2.8.0        devtools_0.1
> stringr_0.4
> [13] roxygen_0.1-2        profr_0.1.1          digest_0.4.2
> testthat_0.3
>
> loaded via a namespace (and not attached):
>  [1] GSEABase_1.10.0   RBGL_1.24.0       XML_3.1-1
> evaluate_0.3      genefilter_1.30.0
>  [6] plyr_1.1          splines_2.11.1    survival_2.35-8
> tools_2.11.1      xtable_1.5-6
>
>
> Thanks,
> -steve
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list