[BioC] GO's to gene's

Sean Davis seandavi at gmail.com
Tue Mar 2 04:08:11 CET 2010


On Mon, Mar 1, 2010 at 9:34 PM, Loren Engrav <engrav at u.washington.edu> wrote:
> Thank you
> You are clearly very good at this
>
> So to check it all out I did it manually on Amigo. Amigo found 33 genes
> (limited to Human and omitting IEA)
>
> The org.HS.eg.db method found 29 of the 33 but did not find
> CST3 (1471) GO:0010711 IEP
> HIF1A (3091) GO:0032963 ISS
> IL6R (3570), GO:0032966 IDA and
> TRAM2 (9697) GO:0032964 IMP
>
> I suppose to figure out, for example, why org.Hs.eg.db does not map 9697 to
> GO:0032964 is complex

Amigo is updated each month, I believe.  The org.Hs.eg.db package is
released only every 6 months (last release, October, 2009), so there
might be some justifiable differences between the two sources.

Sean


>
>> From: Martin Morgan <mtmorgan at fhcrc.org>
>> Date: Mon, 01 Mar 2010 05:16:48 -0800
>> To: Loren Engrav <engrav at u.washington.edu>
>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] GO's to gene's
>>
>> On 02/28/2010 09:01 PM, Loren Engrav wrote:
>>> So I checked
>>>> collagen
>>> And this list matches Amigo
>>> So then would appear the issue lies in
>>>> egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>> Some of the names are finding no associated genes in org.Hs.egGO2EG and so
>>> appear as NA
>>> True? Possible?
>>
>> yes. GO is not H. sapiens specific and ENTREZ ids are not 100%
>> comprehensive, so some GO terms do not map to ENTREZ ids.
>>
>>>>> Also I would like to omit the IEA group
>>
>> maybe
>>
>>   egids <- lapply(egids, function(elt)  elt[names(elt) != "IEA"])
>>   egids[sapply(egids, length) != 0]
>>
>> Martin
>>
>>> My version of org.Hs.egGO2EG is 2.3.6
>>>
>>>
>>>
>>>
>>>
>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>> Date: Sun, 28 Feb 2010 20:33:05 -0800
>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>> Conversation: [BioC] GO's to gene's
>>>> Subject: Re: [BioC] GO's to gene's
>>>>
>>>> Oopps, Amigo says there are 20 such terms, not 68 as I said before, cuz I
>>>> retrieved only BP
>>>>
>>>>
>>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>>> Date: Sun, 28 Feb 2010 20:28:17 -0800
>>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>> Conversation: [BioC] GO's to gene's
>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>
>>>>> Ok thank you
>>>>> I now show
>>>>>> sessionInfo()
>>>>> R version 2.10.1 (2009-12-14)
>>>>> i386-apple-darwin9.8.0
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] org.Hs.eg.db_2.3.6  GO.db_2.3.5         RSQLite_0.8-3
>>>>> AnnotationDbi_1.8.1 DBI_0.2-5
>>>>> [6] Biobase_2.6.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.10.1
>>>>>
>>>>> And all commands pass with no errors, however I see
>>>>>
>>>>>> egids
>>>>> $`GO:0010711`
>>>>>    IEP
>>>>> "1471"
>>>>>
>>>>> $`GO:0030199`
>>>>>     IEA     IEA     ISS     IEA     IMP     IMP     IMP     IMP     NAS
>>>>> IMP     NAS     IMP     ISS
>>>>>   "302"   "304"   "538"   "871"  "1277"  "1278"  "1280"  "1281"  "1281"
>>>>> "1289"  "1289"  "1290"  "1290"
>>>>>     NAS     IDA     NAS     IEA     IEA     IEA     IEA     IEA     NAS
>>>>> ISS     IDA     ISS     NAS
>>>>>  "1301"  "1302"  "1303"  "1805"  "2296"  "2303"  "4010"  "4015"  "4060"
>>>>> "4763"  "7042"  "7046"  "7373"
>>>>>     NAS     NAS
>>>>>  "9508" "50509"
>>>>>
>>>>> $`GO:0030574`
>>>>>      IEA      IEA      IEA      IEA      IEA      IEA      IEA      IEA
>>>>> IEA      IEA      IEA
>>>>>   "4312"   "4313"   "4314"   "4316"   "4317"   "4318"   "4319"   "4320"
>>>>> "4322"   "4325"   "4327"
>>>>>      IEA      IDA      IMP      NAS      IEA      NAS      IEA      IEA
>>>>> IEA      IEA
>>>>>   "5184"   "5645"   "5645"   "5653"   "5657"   "9508"   "9509"  "56547"
>>>>> "64066" "140766"
>>>>>
>>>>> $`GO:0032963`
>>>>>    IEA    IMP
>>>>> "3091" "7148"
>>>>>
>>>>> $`GO:0032964`
>>>>>    IEA    IMP    IMP    TAS    IMP
>>>>>  "871" "1277" "1281" "1281" "1289"
>>>>>
>>>>> $`GO:0032966`
>>>>>    IDA     IC
>>>>> "3569" "4261"
>>>>>
>>>>> $`GO:0032967`
>>>>>    ISS    IDA    IDA     IC    IMP    TAS    IMP
>>>>>  "265" "2147" "2149" "3066" "7040" "7040" "7043"
>>>>>
>>>>> $`GO:0033342`
>>>>>     IMP
>>>>> "23560"
>>>>>
>>>>> So many GO terms containing the word "collagen" are not listed, like
>>>>> 0004656
>>>>> 0005518
>>>>> etc
>>>>> Amigo claims there are 68 such terms and the list above has only 8
>>>>> What did I do wrong?
>>>>> Also I would like to omit the IEA group
>>>>>
>>>>> Thank you
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>> Date: Sun, 28 Feb 2010 19:30:34 -0800
>>>>>> To: Loren Engrav <engrav at u.washington.edu>
>>>>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>
>>>>>> On 02/28/2010 07:17 PM, Loren Engrav wrote:
>>>>>>> Thank you both
>>>>>>> Given my skills, it might be easier/quicker to do it "manually" with
>>>>>>> Amigo
>>>>>>> But I am trying both methods
>>>>>>>
>>>>>>> For the second method I get
>>>>>>>
>>>>>>>> library(GO.db)
>>>>>>> Loading required package: AnnotationDbi
>>>>>>> Loading required package: Biobase
>>>>>>>
>>>>>>> Welcome to Bioconductor
>>>>>>>
>>>>>>>   Vignettes contain introductory material. To view, type
>>>>>>>   'openVignette()'. To cite Bioconductor, see
>>>>>>>   'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>>>>>
>>>>>>> Loading required package: DBI
>>>>>>>> terms <- Term(GOTERM)
>>>>>>> Error in function (classes, fdef, mtable)  :
>>>>>>>   unable to find an inherited method for function "Term", for signature
>>>>>>> "GOTermsAnnDbBimap"
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 2.9.2 Patched (2009-09-05 r49613)
>>>>>>> i386-apple-darwin9.8.0
>>>>>>>
>>>>>>> locale:
>>>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>> ,
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> Update to R version 2.10 and associated Bioc packages, or for a (much)
>>>>>> slower solution (you'll want to check that Term and Ontology return ids
>>>>>> in identical order)
>>>>>>
>>>>>>   terms = eapply(GOTERM, Term)
>>>>>>
>>>>>> etc. I have
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 2.10.1 Patched (2010-02-23 r51168)
>>>>>> x86_64-unknown-linux-gnu
>>>>>>
>>>>>> locale:
>>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GO.db_2.3.5         RSQLite_0.7-3       DBI_0.2-4
>>>>>> [4] AnnotationDbi_1.8.1 Biobase_2.6.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] tools_2.10.1
>>>>>>
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>>>> Date: Sun, 28 Feb 2010 18:42:33 -0800
>>>>>>>> To: Vincent Carey <stvjc at channing.harvard.edu>
>>>>>>>> Cc: Loren Engrav <engrav at u.washington.edu>,
>>>>>>>> "bioconductor at stat.math.ethz.ch"
>>>>>>>> <bioconductor at stat.math.ethz.ch>
>>>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>>>
>>>>>>>> On 02/28/2010 06:14 PM, Vincent Carey wrote:
>>>>>>>>> Perhaps there is a package with such functionality.  However, with the
>>>>>>>>> GO.db package in place, you need to do a little
>>>>>>>>> programming, perhaps along the lines of
>>>>>>>>>
>>>>>>>>> querGO = function(str, attr = "definition", ont = "MF") {
>>>>>>>>>   require(GO.db, quietly = TRUE)
>>>>>>>>>   gc = GO_dbconn()
>>>>>>>>>   quer.1 = paste("select go_id, term from go_term where",
>>>>>>>>>   attr, "like('%")
>>>>>>>>>   quer.2 = "%') and ontology = '"
>>>>>>>>>   quer.3 = "'"
>>>>>>>>>   quer = paste(quer.1, str, quer.2, ont, quer.3, collapse = "",
>>>>>>>>>   sep = "")
>>>>>>>>>   dbGetQuery(gc, quer)
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> whereby
>>>>>>>>>
>>>>>>>>>> querGO("collagen", "term")
>>>>>>>>>        go_id
>>>>>>>>> term
>>>>>>>>> 1 GO:0004656                     procollagen-proline 4-dioxygenase
>>>>>>>>> activity
>>>>>>>>> 2 GO:0005518                                               collagen
>>>>>>>>> binding
>>>>>>>>> 3 GO:0008475                      procollagen-lysine 5-dioxygenase
>>>>>>>>> activity
>>>>>>>>> 4 GO:0019797                     procollagen-proline 3-dioxygenase
>>>>>>>>> activity
>>>>>>>>> 5 GO:0019798                       procollagen-proline dioxygenase
>>>>>>>>> activity
>>>>>>>>> 6 GO:0033823                       procollagen glucosyltransferase
>>>>>>>>> activity
>>>>>>>>> 7 GO:0042329 structural constituent of collagen and cuticulin-based
>>>>>>>>> cuticle
>>>>>>>>> 8 GO:0050211                     procollagen galactosyltransferase
>>>>>>>>> activity
>>>>>>>>> 9 GO:0070052                                             collagen V
>>>>>>>>> binding
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Also
>>>>>>>>
>>>>>>>>   library(GO.db)
>>>>>>>>   terms <- Term(GOTERM)  # or maybe Definition(GOTERM) ?
>>>>>>>>   ontologies <- Ontology(GOTERM)
>>>>>>>>   collagen <- terms[grepl("collagen", terms) & ("MF" == ontologies)]
>>>>>>>>
>>>>>>>> and the next step,
>>>>>>>>
>>>>>>>>   library(org.Hs.eg.db)
>>>>>>>>   egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>>>>>>>   egids <- egids[!is.na(egids)]
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 28, 2010 at 8:56 PM, Loren Engrav <engrav at u.washington.edu>
>>>>>>>>> wrote:
>>>>>>>>>> Is there a BioC package that will find all the GO terms containing
>>>>>>>>>> some
>>>>>>>>>> word, like perhaps ³collagen²
>>>>>>>>>> And then find all the genes contained within those found terms
>>>>>>>>>>
>>>>>>>>>> I scanned
>>>>>>>>>> GoProfiles
>>>>>>>>>> GOSemSim
>>>>>>>>>> GOstats
>>>>>>>>>> GoTools and
>>>>>>>>>> TopGO
>>>>>>>>>>
>>>>>>>>>> And could not determine that any would do that.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives:
>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Martin Morgan
>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>> Phone: (206) 667-2793
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Martin Morgan
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list