[BioC] GO's to gene's

Martin Morgan mtmorgan at fhcrc.org
Tue Mar 2 04:49:05 CET 2010


On 03/01/2010 06:34 PM, Loren Engrav wrote:
> Thank you
> You are clearly very good at this
> 
> So to check it all out I did it manually on Amigo. Amigo found 33 genes
> (limited to Human and omitting IEA)
> 
> The org.HS.eg.db method found 29 of the 33 but did not find
> CST3 (1471) GO:0010711 IEP
> HIF1A (3091) GO:0032963 ISS
> IL6R (3570), GO:0032966 IDA and
> TRAM2 (9697) GO:0032964 IMP
> 
> I suppose to figure out, for example, why org.Hs.eg.db does not map 9697 to
> GO:0032964 is complex

> names(org.Hs.egGO[["9697"]])
[1] "GO:0015031" "GO:0065002" "GO:0016020" "GO:0016021"

Hmm, what are the offspring / ancestors of GO:0032964 ?

> GOBPOFFSPRING[["GO:0032964"]]
[1] "GO:0032965" "GO:0032966" "GO:0032967"
> GOBPANCESTOR[["GO:0032964"]]
 [1] "all"        "GO:0008152" "GO:0008150" "GO:0009058" "GO:0009059"
 [6] "GO:0032501" "GO:0032963" "GO:0043170" "GO:0044236" "GO:0044259"

Nope nothing jumping out. Where's the GO data coming from?

> org.Hs.eg() ## or GO()
[snip]
Date for GO data: 20090830

Whereas AMIGO says (at the bottom of each page)

  GO database release 2010-02-27

so that looks like a likely issue that would require some more
substantial investigation. Merits of using a 'current' db (Amigo) vs a
'versioned' db (GO.db)? See mailing list archives, e.g., current
state-of-knowledge vs. reproducibility (how would we redo the analysis
we did last month and get the same results with AMIGO?).

On the other hand

> org.Hs.egGO2EG[["GO:0010711"]]
   IEP
"1471"
> GOTERM[["GO:0010711"]]
GOID: GO:0010711
Term: negative regulation of collagen catabolic process
Ontology: BP
Definition: Any process that decreases the rate, frequency or extent of
    collagen catabolism. Collagen catabolism is the proteolytic
    chemical reactions and pathways resulting in the breakdown of
    collagen in the extracellular matrix.
Synonym: down regulation of collagen catabolic process
Synonym: down-regulation of collagen catabolic process
Synonym: downregulation of collagen catabolic process
Synonym: inhibition of collagen catabolic process
Synonym: negative regulation of collagen breakdown
Synonym: negative regulation of collagen catabolism
Synonym: negative regulation of collagen degradation

so why didn't we find that one?

> terms <- Term(GOTERM)  # or maybe Definition(GOTERM)
> "GO:0010711" %in% names(terms)
[1] TRUE
> terms[["GO:0010711"]]
[1] "negative regulation of collagen catabolic process"

yep it's there

> ontologies <- Ontology(GOTERM)
> ontologies[["GO:0010711"]]
[1] "BP"
> collagen <- terms[grepl("collagen", terms) & ("BP" == ontologies)]
> collagen[["GO:0010711"]]
[1] "negative regulation of collagen catabolic process"

yep it's there (or were we looking for MF, as below?).

> egids[["GO:0010711"]]
   IEP
"1471"

yep it's there. So this makes me think it's a programming error or a
miscommunication. I'd suggest you write a little function

getGO <-
    function(termLike, ontology, exludeEvidence)
{
    ## a few lines of code here, representing the query you perform
}

and perhaps sharing that with the list will shed some light.

Martin


> 
> Thank you
> 
> 
>> From: Martin Morgan <mtmorgan at fhcrc.org>
>> Date: Mon, 01 Mar 2010 05:16:48 -0800
>> To: Loren Engrav <engrav at u.washington.edu>
>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] GO's to gene's
>>
>> On 02/28/2010 09:01 PM, Loren Engrav wrote:
>>> So I checked
>>>> collagen
>>> And this list matches Amigo
>>> So then would appear the issue lies in
>>>> egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>> Some of the names are finding no associated genes in org.Hs.egGO2EG and so
>>> appear as NA
>>> True? Possible?
>>
>> yes. GO is not H. sapiens specific and ENTREZ ids are not 100%
>> comprehensive, so some GO terms do not map to ENTREZ ids.
>>
>>>>> Also I would like to omit the IEA group
>>
>> maybe
>>
>>   egids <- lapply(egids, function(elt)  elt[names(elt) != "IEA"])
>>   egids[sapply(egids, length) != 0]
>>
>> Martin
>>
>>> My version of org.Hs.egGO2EG is 2.3.6
>>>
>>>
>>>
>>>
>>>
>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>> Date: Sun, 28 Feb 2010 20:33:05 -0800
>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>> Conversation: [BioC] GO's to gene's
>>>> Subject: Re: [BioC] GO's to gene's
>>>>
>>>> Oopps, Amigo says there are 20 such terms, not 68 as I said before, cuz I
>>>> retrieved only BP
>>>>
>>>>
>>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>>> Date: Sun, 28 Feb 2010 20:28:17 -0800
>>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>> Conversation: [BioC] GO's to gene's
>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>
>>>>> Ok thank you
>>>>> I now show
>>>>>> sessionInfo()
>>>>> R version 2.10.1 (2009-12-14)
>>>>> i386-apple-darwin9.8.0
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] org.Hs.eg.db_2.3.6  GO.db_2.3.5         RSQLite_0.8-3
>>>>> AnnotationDbi_1.8.1 DBI_0.2-5
>>>>> [6] Biobase_2.6.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.10.1
>>>>>
>>>>> And all commands pass with no errors, however I see
>>>>>
>>>>>> egids
>>>>> $`GO:0010711`
>>>>>    IEP 
>>>>> "1471" 
>>>>>
>>>>> $`GO:0030199`
>>>>>     IEA     IEA     ISS     IEA     IMP     IMP     IMP     IMP     NAS
>>>>> IMP     NAS     IMP     ISS
>>>>>   "302"   "304"   "538"   "871"  "1277"  "1278"  "1280"  "1281"  "1281"
>>>>> "1289"  "1289"  "1290"  "1290"
>>>>>     NAS     IDA     NAS     IEA     IEA     IEA     IEA     IEA     NAS
>>>>> ISS     IDA     ISS     NAS
>>>>>  "1301"  "1302"  "1303"  "1805"  "2296"  "2303"  "4010"  "4015"  "4060"
>>>>> "4763"  "7042"  "7046"  "7373"
>>>>>     NAS     NAS
>>>>>  "9508" "50509"
>>>>>
>>>>> $`GO:0030574`
>>>>>      IEA      IEA      IEA      IEA      IEA      IEA      IEA      IEA
>>>>> IEA      IEA      IEA
>>>>>   "4312"   "4313"   "4314"   "4316"   "4317"   "4318"   "4319"   "4320"
>>>>> "4322"   "4325"   "4327"
>>>>>      IEA      IDA      IMP      NAS      IEA      NAS      IEA      IEA
>>>>> IEA      IEA 
>>>>>   "5184"   "5645"   "5645"   "5653"   "5657"   "9508"   "9509"  "56547"
>>>>> "64066" "140766"
>>>>>
>>>>> $`GO:0032963`
>>>>>    IEA    IMP 
>>>>> "3091" "7148" 
>>>>>
>>>>> $`GO:0032964`
>>>>>    IEA    IMP    IMP    TAS    IMP
>>>>>  "871" "1277" "1281" "1281" "1289"
>>>>>
>>>>> $`GO:0032966`
>>>>>    IDA     IC 
>>>>> "3569" "4261" 
>>>>>
>>>>> $`GO:0032967`
>>>>>    ISS    IDA    IDA     IC    IMP    TAS    IMP
>>>>>  "265" "2147" "2149" "3066" "7040" "7040" "7043"
>>>>>
>>>>> $`GO:0033342`
>>>>>     IMP 
>>>>> "23560"
>>>>>
>>>>> So many GO terms containing the word "collagen" are not listed, like
>>>>> 0004656
>>>>> 0005518
>>>>> etc
>>>>> Amigo claims there are 68 such terms and the list above has only 8
>>>>> What did I do wrong?
>>>>> Also I would like to omit the IEA group
>>>>>
>>>>> Thank you
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>> Date: Sun, 28 Feb 2010 19:30:34 -0800
>>>>>> To: Loren Engrav <engrav at u.washington.edu>
>>>>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>
>>>>>> On 02/28/2010 07:17 PM, Loren Engrav wrote:
>>>>>>> Thank you both
>>>>>>> Given my skills, it might be easier/quicker to do it "manually" with
>>>>>>> Amigo
>>>>>>> But I am trying both methods
>>>>>>>
>>>>>>> For the second method I get
>>>>>>>
>>>>>>>> library(GO.db)
>>>>>>> Loading required package: AnnotationDbi
>>>>>>> Loading required package: Biobase
>>>>>>>
>>>>>>> Welcome to Bioconductor
>>>>>>>
>>>>>>>   Vignettes contain introductory material. To view, type
>>>>>>>   'openVignette()'. To cite Bioconductor, see
>>>>>>>   'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>>>>>
>>>>>>> Loading required package: DBI
>>>>>>>> terms <- Term(GOTERM)
>>>>>>> Error in function (classes, fdef, mtable)  :
>>>>>>>   unable to find an inherited method for function "Term", for signature
>>>>>>> "GOTermsAnnDbBimap"
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 2.9.2 Patched (2009-09-05 r49613)
>>>>>>> i386-apple-darwin9.8.0
>>>>>>>
>>>>>>> locale:
>>>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>> ,
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> Update to R version 2.10 and associated Bioc packages, or for a (much)
>>>>>> slower solution (you'll want to check that Term and Ontology return ids
>>>>>> in identical order)
>>>>>>
>>>>>>   terms = eapply(GOTERM, Term)
>>>>>>
>>>>>> etc. I have
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 2.10.1 Patched (2010-02-23 r51168)
>>>>>> x86_64-unknown-linux-gnu
>>>>>>
>>>>>> locale:
>>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GO.db_2.3.5         RSQLite_0.7-3       DBI_0.2-4
>>>>>> [4] AnnotationDbi_1.8.1 Biobase_2.6.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] tools_2.10.1
>>>>>>
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>>>> Date: Sun, 28 Feb 2010 18:42:33 -0800
>>>>>>>> To: Vincent Carey <stvjc at channing.harvard.edu>
>>>>>>>> Cc: Loren Engrav <engrav at u.washington.edu>,
>>>>>>>> "bioconductor at stat.math.ethz.ch"
>>>>>>>> <bioconductor at stat.math.ethz.ch>
>>>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>>>
>>>>>>>> On 02/28/2010 06:14 PM, Vincent Carey wrote:
>>>>>>>>> Perhaps there is a package with such functionality.  However, with the
>>>>>>>>> GO.db package in place, you need to do a little
>>>>>>>>> programming, perhaps along the lines of
>>>>>>>>>
>>>>>>>>> querGO = function(str, attr = "definition", ont = "MF") {
>>>>>>>>>   require(GO.db, quietly = TRUE)
>>>>>>>>>   gc = GO_dbconn()
>>>>>>>>>   quer.1 = paste("select go_id, term from go_term where",
>>>>>>>>>   attr, "like('%")
>>>>>>>>>   quer.2 = "%') and ontology = '"
>>>>>>>>>   quer.3 = "'"
>>>>>>>>>   quer = paste(quer.1, str, quer.2, ont, quer.3, collapse = "",
>>>>>>>>>   sep = "")
>>>>>>>>>   dbGetQuery(gc, quer)
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> whereby
>>>>>>>>>
>>>>>>>>>> querGO("collagen", "term")
>>>>>>>>>        go_id
>>>>>>>>> term
>>>>>>>>> 1 GO:0004656                     procollagen-proline 4-dioxygenase
>>>>>>>>> activity
>>>>>>>>> 2 GO:0005518                                               collagen
>>>>>>>>> binding
>>>>>>>>> 3 GO:0008475                      procollagen-lysine 5-dioxygenase
>>>>>>>>> activity
>>>>>>>>> 4 GO:0019797                     procollagen-proline 3-dioxygenase
>>>>>>>>> activity
>>>>>>>>> 5 GO:0019798                       procollagen-proline dioxygenase
>>>>>>>>> activity
>>>>>>>>> 6 GO:0033823                       procollagen glucosyltransferase
>>>>>>>>> activity
>>>>>>>>> 7 GO:0042329 structural constituent of collagen and cuticulin-based
>>>>>>>>> cuticle
>>>>>>>>> 8 GO:0050211                     procollagen galactosyltransferase
>>>>>>>>> activity
>>>>>>>>> 9 GO:0070052                                             collagen V
>>>>>>>>> binding
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Also
>>>>>>>>
>>>>>>>>   library(GO.db)
>>>>>>>>   terms <- Term(GOTERM)  # or maybe Definition(GOTERM) ?
>>>>>>>>   ontologies <- Ontology(GOTERM)
>>>>>>>>   collagen <- terms[grepl("collagen", terms) & ("MF" == ontologies)]
>>>>>>>>
>>>>>>>> and the next step,
>>>>>>>>
>>>>>>>>   library(org.Hs.eg.db)
>>>>>>>>   egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>>>>>>>   egids <- egids[!is.na(egids)]
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 28, 2010 at 8:56 PM, Loren Engrav <engrav at u.washington.edu>
>>>>>>>>> wrote:
>>>>>>>>>> Is there a BioC package that will find all the GO terms containing
>>>>>>>>>> some
>>>>>>>>>> word, like perhaps ³collagen²
>>>>>>>>>> And then find all the genes contained within those found terms
>>>>>>>>>>
>>>>>>>>>> I scanned
>>>>>>>>>> GoProfiles
>>>>>>>>>> GOSemSim
>>>>>>>>>> GOstats
>>>>>>>>>> GoTools and
>>>>>>>>>> TopGO
>>>>>>>>>>
>>>>>>>>>> And could not determine that any would do that.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives:
>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Martin Morgan
>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>> Phone: (206) 667-2793
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Martin Morgan
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> -- 
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list