[BioC] GO's to gene's

Loren Engrav engrav at u.washington.edu
Tue Mar 2 03:34:54 CET 2010


Thank you
You are clearly very good at this

So to check it all out I did it manually on Amigo. Amigo found 33 genes
(limited to Human and omitting IEA)

The org.HS.eg.db method found 29 of the 33 but did not find
CST3 (1471) GO:0010711 IEP
HIF1A (3091) GO:0032963 ISS
IL6R (3570), GO:0032966 IDA and
TRAM2 (9697) GO:0032964 IMP

I suppose to figure out, for example, why org.Hs.eg.db does not map 9697 to
GO:0032964 is complex

Thank you


> From: Martin Morgan <mtmorgan at fhcrc.org>
> Date: Mon, 01 Mar 2010 05:16:48 -0800
> To: Loren Engrav <engrav at u.washington.edu>
> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] GO's to gene's
> 
> On 02/28/2010 09:01 PM, Loren Engrav wrote:
>> So I checked
>>> collagen
>> And this list matches Amigo
>> So then would appear the issue lies in
>>> egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>> Some of the names are finding no associated genes in org.Hs.egGO2EG and so
>> appear as NA
>> True? Possible?
> 
> yes. GO is not H. sapiens specific and ENTREZ ids are not 100%
> comprehensive, so some GO terms do not map to ENTREZ ids.
> 
>>>> Also I would like to omit the IEA group
> 
> maybe
> 
>   egids <- lapply(egids, function(elt)  elt[names(elt) != "IEA"])
>   egids[sapply(egids, length) != 0]
> 
> Martin
> 
>> My version of org.Hs.egGO2EG is 2.3.6
>> 
>> 
>> 
>> 
>> 
>>> From: Loren Engrav <engrav at u.washington.edu>
>>> Date: Sun, 28 Feb 2010 20:33:05 -0800
>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Conversation: [BioC] GO's to gene's
>>> Subject: Re: [BioC] GO's to gene's
>>> 
>>> Oopps, Amigo says there are 20 such terms, not 68 as I said before, cuz I
>>> retrieved only BP
>>> 
>>> 
>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>> Date: Sun, 28 Feb 2010 20:28:17 -0800
>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>> Conversation: [BioC] GO's to gene's
>>>> Subject: Re: [BioC] GO's to gene's
>>>> 
>>>> Ok thank you
>>>> I now show
>>>>> sessionInfo()
>>>> R version 2.10.1 (2009-12-14)
>>>> i386-apple-darwin9.8.0
>>>> 
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>> 
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> 
>>>> other attached packages:
>>>> [1] org.Hs.eg.db_2.3.6  GO.db_2.3.5         RSQLite_0.8-3
>>>> AnnotationDbi_1.8.1 DBI_0.2-5
>>>> [6] Biobase_2.6.1
>>>> 
>>>> loaded via a namespace (and not attached):
>>>> [1] tools_2.10.1
>>>> 
>>>> And all commands pass with no errors, however I see
>>>> 
>>>>> egids
>>>> $`GO:0010711`
>>>>    IEP 
>>>> "1471" 
>>>> 
>>>> $`GO:0030199`
>>>>     IEA     IEA     ISS     IEA     IMP     IMP     IMP     IMP     NAS
>>>> IMP     NAS     IMP     ISS
>>>>   "302"   "304"   "538"   "871"  "1277"  "1278"  "1280"  "1281"  "1281"
>>>> "1289"  "1289"  "1290"  "1290"
>>>>     NAS     IDA     NAS     IEA     IEA     IEA     IEA     IEA     NAS
>>>> ISS     IDA     ISS     NAS
>>>>  "1301"  "1302"  "1303"  "1805"  "2296"  "2303"  "4010"  "4015"  "4060"
>>>> "4763"  "7042"  "7046"  "7373"
>>>>     NAS     NAS
>>>>  "9508" "50509"
>>>> 
>>>> $`GO:0030574`
>>>>      IEA      IEA      IEA      IEA      IEA      IEA      IEA      IEA
>>>> IEA      IEA      IEA
>>>>   "4312"   "4313"   "4314"   "4316"   "4317"   "4318"   "4319"   "4320"
>>>> "4322"   "4325"   "4327"
>>>>      IEA      IDA      IMP      NAS      IEA      NAS      IEA      IEA
>>>> IEA      IEA 
>>>>   "5184"   "5645"   "5645"   "5653"   "5657"   "9508"   "9509"  "56547"
>>>> "64066" "140766"
>>>> 
>>>> $`GO:0032963`
>>>>    IEA    IMP 
>>>> "3091" "7148" 
>>>> 
>>>> $`GO:0032964`
>>>>    IEA    IMP    IMP    TAS    IMP
>>>>  "871" "1277" "1281" "1281" "1289"
>>>> 
>>>> $`GO:0032966`
>>>>    IDA     IC 
>>>> "3569" "4261" 
>>>> 
>>>> $`GO:0032967`
>>>>    ISS    IDA    IDA     IC    IMP    TAS    IMP
>>>>  "265" "2147" "2149" "3066" "7040" "7040" "7043"
>>>> 
>>>> $`GO:0033342`
>>>>     IMP 
>>>> "23560"
>>>> 
>>>> So many GO terms containing the word "collagen" are not listed, like
>>>> 0004656
>>>> 0005518
>>>> etc
>>>> Amigo claims there are 68 such terms and the list above has only 8
>>>> What did I do wrong?
>>>> Also I would like to omit the IEA group
>>>> 
>>>> Thank you
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>> Date: Sun, 28 Feb 2010 19:30:34 -0800
>>>>> To: Loren Engrav <engrav at u.washington.edu>
>>>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>> Subject: Re: [BioC] GO's to gene's
>>>>> 
>>>>> On 02/28/2010 07:17 PM, Loren Engrav wrote:
>>>>>> Thank you both
>>>>>> Given my skills, it might be easier/quicker to do it "manually" with
>>>>>> Amigo
>>>>>> But I am trying both methods
>>>>>> 
>>>>>> For the second method I get
>>>>>> 
>>>>>>> library(GO.db)
>>>>>> Loading required package: AnnotationDbi
>>>>>> Loading required package: Biobase
>>>>>> 
>>>>>> Welcome to Bioconductor
>>>>>> 
>>>>>>   Vignettes contain introductory material. To view, type
>>>>>>   'openVignette()'. To cite Bioconductor, see
>>>>>>   'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>>>> 
>>>>>> Loading required package: DBI
>>>>>>> terms <- Term(GOTERM)
>>>>>> Error in function (classes, fdef, mtable)  :
>>>>>>   unable to find an inherited method for function "Term", for signature
>>>>>> "GOTermsAnnDbBimap"
>>>>>> 
>>>>>>> sessionInfo()
>>>>>> R version 2.9.2 Patched (2009-09-05 r49613)
>>>>>> i386-apple-darwin9.8.0
>>>>>> 
>>>>>> locale:
>>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>> ,
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>> 
>>>>> Update to R version 2.10 and associated Bioc packages, or for a (much)
>>>>> slower solution (you'll want to check that Term and Ontology return ids
>>>>> in identical order)
>>>>> 
>>>>>   terms = eapply(GOTERM, Term)
>>>>> 
>>>>> etc. I have
>>>>> 
>>>>>> sessionInfo()
>>>>> R version 2.10.1 Patched (2010-02-23 r51168)
>>>>> x86_64-unknown-linux-gnu
>>>>> 
>>>>> locale:
>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>> 
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>> 
>>>>> other attached packages:
>>>>> [1] GO.db_2.3.5         RSQLite_0.7-3       DBI_0.2-4
>>>>> [4] AnnotationDbi_1.8.1 Biobase_2.6.1
>>>>> 
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.10.1
>>>>> 
>>>>> 
>>>>> Martin
>>>>> 
>>>>>> 
>>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>>> Date: Sun, 28 Feb 2010 18:42:33 -0800
>>>>>>> To: Vincent Carey <stvjc at channing.harvard.edu>
>>>>>>> Cc: Loren Engrav <engrav at u.washington.edu>,
>>>>>>> "bioconductor at stat.math.ethz.ch"
>>>>>>> <bioconductor at stat.math.ethz.ch>
>>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>> 
>>>>>>> On 02/28/2010 06:14 PM, Vincent Carey wrote:
>>>>>>>> Perhaps there is a package with such functionality.  However, with the
>>>>>>>> GO.db package in place, you need to do a little
>>>>>>>> programming, perhaps along the lines of
>>>>>>>> 
>>>>>>>> querGO = function(str, attr = "definition", ont = "MF") {
>>>>>>>>   require(GO.db, quietly = TRUE)
>>>>>>>>   gc = GO_dbconn()
>>>>>>>>   quer.1 = paste("select go_id, term from go_term where",
>>>>>>>>   attr, "like('%")
>>>>>>>>   quer.2 = "%') and ontology = '"
>>>>>>>>   quer.3 = "'"
>>>>>>>>   quer = paste(quer.1, str, quer.2, ont, quer.3, collapse = "",
>>>>>>>>   sep = "")
>>>>>>>>   dbGetQuery(gc, quer)
>>>>>>>> }
>>>>>>>> 
>>>>>>>> whereby
>>>>>>>> 
>>>>>>>>> querGO("collagen", "term")
>>>>>>>>        go_id
>>>>>>>> term
>>>>>>>> 1 GO:0004656                     procollagen-proline 4-dioxygenase
>>>>>>>> activity
>>>>>>>> 2 GO:0005518                                               collagen
>>>>>>>> binding
>>>>>>>> 3 GO:0008475                      procollagen-lysine 5-dioxygenase
>>>>>>>> activity
>>>>>>>> 4 GO:0019797                     procollagen-proline 3-dioxygenase
>>>>>>>> activity
>>>>>>>> 5 GO:0019798                       procollagen-proline dioxygenase
>>>>>>>> activity
>>>>>>>> 6 GO:0033823                       procollagen glucosyltransferase
>>>>>>>> activity
>>>>>>>> 7 GO:0042329 structural constituent of collagen and cuticulin-based
>>>>>>>> cuticle
>>>>>>>> 8 GO:0050211                     procollagen galactosyltransferase
>>>>>>>> activity
>>>>>>>> 9 GO:0070052                                             collagen V
>>>>>>>> binding
>>>>>>>>> 
>>>>>>> 
>>>>>>> Also
>>>>>>> 
>>>>>>>   library(GO.db)
>>>>>>>   terms <- Term(GOTERM)  # or maybe Definition(GOTERM) ?
>>>>>>>   ontologies <- Ontology(GOTERM)
>>>>>>>   collagen <- terms[grepl("collagen", terms) & ("MF" == ontologies)]
>>>>>>> 
>>>>>>> and the next step,
>>>>>>> 
>>>>>>>   library(org.Hs.eg.db)
>>>>>>>   egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>>>>>>   egids <- egids[!is.na(egids)]
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Feb 28, 2010 at 8:56 PM, Loren Engrav <engrav at u.washington.edu>
>>>>>>>> wrote:
>>>>>>>>> Is there a BioC package that will find all the GO terms containing
>>>>>>>>> some
>>>>>>>>> word, like perhaps ³collagen²
>>>>>>>>> And then find all the genes contained within those found terms
>>>>>>>>> 
>>>>>>>>> I scanned
>>>>>>>>> GoProfiles
>>>>>>>>> GOSemSim
>>>>>>>>> GOstats
>>>>>>>>> GoTools and
>>>>>>>>> TopGO
>>>>>>>>> 
>>>>>>>>> And could not determine that any would do that.
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Martin Morgan
>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>> 
>>>>>>> Location: Arnold Building M1 B861
>>>>>>> Phone: (206) 667-2793
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Martin Morgan
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>> 
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793



More information about the Bioconductor mailing list