[BioC] GO's to gene's

Martin Morgan mtmorgan at fhcrc.org
Mon Mar 1 14:16:48 CET 2010


On 02/28/2010 09:01 PM, Loren Engrav wrote:
> So I checked
>> collagen
> And this list matches Amigo
> So then would appear the issue lies in
>> egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
> Some of the names are finding no associated genes in org.Hs.egGO2EG and so
> appear as NA
> True? Possible?

yes. GO is not H. sapiens specific and ENTREZ ids are not 100%
comprehensive, so some GO terms do not map to ENTREZ ids.

>>> Also I would like to omit the IEA group

maybe

  egids <- lapply(egids, function(elt)  elt[names(elt) != "IEA"])
  egids[sapply(egids, length) != 0]

Martin

> My version of org.Hs.egGO2EG is 2.3.6
> 
> 
> 
> 
> 
>> From: Loren Engrav <engrav at u.washington.edu>
>> Date: Sun, 28 Feb 2010 20:33:05 -0800
>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Conversation: [BioC] GO's to gene's
>> Subject: Re: [BioC] GO's to gene's
>>
>> Oopps, Amigo says there are 20 such terms, not 68 as I said before, cuz I
>> retrieved only BP
>>
>>
>>> From: Loren Engrav <engrav at u.washington.edu>
>>> Date: Sun, 28 Feb 2010 20:28:17 -0800
>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Conversation: [BioC] GO's to gene's
>>> Subject: Re: [BioC] GO's to gene's
>>>
>>> Ok thank you
>>> I now show
>>>> sessionInfo()
>>> R version 2.10.1 (2009-12-14)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] org.Hs.eg.db_2.3.6  GO.db_2.3.5         RSQLite_0.8-3
>>> AnnotationDbi_1.8.1 DBI_0.2-5
>>> [6] Biobase_2.6.1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.10.1
>>>
>>> And all commands pass with no errors, however I see
>>>
>>>> egids
>>> $`GO:0010711`
>>>    IEP 
>>> "1471" 
>>>
>>> $`GO:0030199`
>>>     IEA     IEA     ISS     IEA     IMP     IMP     IMP     IMP     NAS
>>> IMP     NAS     IMP     ISS
>>>   "302"   "304"   "538"   "871"  "1277"  "1278"  "1280"  "1281"  "1281"
>>> "1289"  "1289"  "1290"  "1290"
>>>     NAS     IDA     NAS     IEA     IEA     IEA     IEA     IEA     NAS
>>> ISS     IDA     ISS     NAS
>>>  "1301"  "1302"  "1303"  "1805"  "2296"  "2303"  "4010"  "4015"  "4060"
>>> "4763"  "7042"  "7046"  "7373"
>>>     NAS     NAS 
>>>  "9508" "50509" 
>>>
>>> $`GO:0030574`
>>>      IEA      IEA      IEA      IEA      IEA      IEA      IEA      IEA
>>> IEA      IEA      IEA
>>>   "4312"   "4313"   "4314"   "4316"   "4317"   "4318"   "4319"   "4320"
>>> "4322"   "4325"   "4327"
>>>      IEA      IDA      IMP      NAS      IEA      NAS      IEA      IEA
>>> IEA      IEA 
>>>   "5184"   "5645"   "5645"   "5653"   "5657"   "9508"   "9509"  "56547"
>>> "64066" "140766"
>>>
>>> $`GO:0032963`
>>>    IEA    IMP 
>>> "3091" "7148" 
>>>
>>> $`GO:0032964`
>>>    IEA    IMP    IMP    TAS    IMP
>>>  "871" "1277" "1281" "1281" "1289"
>>>
>>> $`GO:0032966`
>>>    IDA     IC 
>>> "3569" "4261" 
>>>
>>> $`GO:0032967`
>>>    ISS    IDA    IDA     IC    IMP    TAS    IMP
>>>  "265" "2147" "2149" "3066" "7040" "7040" "7043"
>>>
>>> $`GO:0033342`
>>>     IMP 
>>> "23560"
>>>
>>> So many GO terms containing the word "collagen" are not listed, like
>>> 0004656
>>> 0005518
>>> etc
>>> Amigo claims there are 68 such terms and the list above has only 8
>>> What did I do wrong?
>>> Also I would like to omit the IEA group
>>>
>>> Thank you
>>>
>>>
>>>
>>>
>>>
>>>
>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>> Date: Sun, 28 Feb 2010 19:30:34 -0800
>>>> To: Loren Engrav <engrav at u.washington.edu>
>>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>> Subject: Re: [BioC] GO's to gene's
>>>>
>>>> On 02/28/2010 07:17 PM, Loren Engrav wrote:
>>>>> Thank you both
>>>>> Given my skills, it might be easier/quicker to do it "manually" with Amigo
>>>>> But I am trying both methods
>>>>>
>>>>> For the second method I get
>>>>>
>>>>>> library(GO.db)
>>>>> Loading required package: AnnotationDbi
>>>>> Loading required package: Biobase
>>>>>
>>>>> Welcome to Bioconductor
>>>>>
>>>>>   Vignettes contain introductory material. To view, type
>>>>>   'openVignette()'. To cite Bioconductor, see
>>>>>   'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>>>
>>>>> Loading required package: DBI
>>>>>> terms <- Term(GOTERM)
>>>>> Error in function (classes, fdef, mtable)  :
>>>>>   unable to find an inherited method for function "Term", for signature
>>>>> "GOTermsAnnDbBimap"
>>>>>
>>>>>> sessionInfo()
>>>>> R version 2.9.2 Patched (2009-09-05 r49613)
>>>>> i386-apple-darwin9.8.0
>>>>>
>>>>> locale:
>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>> ,
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> Update to R version 2.10 and associated Bioc packages, or for a (much)
>>>> slower solution (you'll want to check that Term and Ontology return ids
>>>> in identical order)
>>>>
>>>>   terms = eapply(GOTERM, Term)
>>>>
>>>> etc. I have
>>>>
>>>>> sessionInfo()
>>>> R version 2.10.1 Patched (2010-02-23 r51168)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] GO.db_2.3.5         RSQLite_0.7-3       DBI_0.2-4
>>>> [4] AnnotationDbi_1.8.1 Biobase_2.6.1
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] tools_2.10.1
>>>>
>>>>
>>>> Martin
>>>>
>>>>>
>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>> Date: Sun, 28 Feb 2010 18:42:33 -0800
>>>>>> To: Vincent Carey <stvjc at channing.harvard.edu>
>>>>>> Cc: Loren Engrav <engrav at u.washington.edu>,
>>>>>> "bioconductor at stat.math.ethz.ch"
>>>>>> <bioconductor at stat.math.ethz.ch>
>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>
>>>>>> On 02/28/2010 06:14 PM, Vincent Carey wrote:
>>>>>>> Perhaps there is a package with such functionality.  However, with the
>>>>>>> GO.db package in place, you need to do a little
>>>>>>> programming, perhaps along the lines of
>>>>>>>
>>>>>>> querGO = function(str, attr = "definition", ont = "MF") {
>>>>>>>   require(GO.db, quietly = TRUE)
>>>>>>>   gc = GO_dbconn()
>>>>>>>   quer.1 = paste("select go_id, term from go_term where",
>>>>>>>   attr, "like('%")
>>>>>>>   quer.2 = "%') and ontology = '"
>>>>>>>   quer.3 = "'"
>>>>>>>   quer = paste(quer.1, str, quer.2, ont, quer.3, collapse = "",
>>>>>>>   sep = "")
>>>>>>>   dbGetQuery(gc, quer)
>>>>>>> }
>>>>>>>
>>>>>>> whereby
>>>>>>>
>>>>>>>> querGO("collagen", "term")
>>>>>>>        go_id
>>>>>>> term
>>>>>>> 1 GO:0004656                     procollagen-proline 4-dioxygenase
>>>>>>> activity
>>>>>>> 2 GO:0005518                                               collagen
>>>>>>> binding
>>>>>>> 3 GO:0008475                      procollagen-lysine 5-dioxygenase
>>>>>>> activity
>>>>>>> 4 GO:0019797                     procollagen-proline 3-dioxygenase
>>>>>>> activity
>>>>>>> 5 GO:0019798                       procollagen-proline dioxygenase
>>>>>>> activity
>>>>>>> 6 GO:0033823                       procollagen glucosyltransferase
>>>>>>> activity
>>>>>>> 7 GO:0042329 structural constituent of collagen and cuticulin-based
>>>>>>> cuticle
>>>>>>> 8 GO:0050211                     procollagen galactosyltransferase
>>>>>>> activity
>>>>>>> 9 GO:0070052                                             collagen V
>>>>>>> binding
>>>>>>>>
>>>>>>
>>>>>> Also
>>>>>>
>>>>>>   library(GO.db)
>>>>>>   terms <- Term(GOTERM)  # or maybe Definition(GOTERM) ?
>>>>>>   ontologies <- Ontology(GOTERM)
>>>>>>   collagen <- terms[grepl("collagen", terms) & ("MF" == ontologies)]
>>>>>>
>>>>>> and the next step,
>>>>>>
>>>>>>   library(org.Hs.eg.db)
>>>>>>   egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>>>>>   egids <- egids[!is.na(egids)]
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 28, 2010 at 8:56 PM, Loren Engrav <engrav at u.washington.edu>
>>>>>>> wrote:
>>>>>>>> Is there a BioC package that will find all the GO terms containing some
>>>>>>>> word, like perhaps ³collagen²
>>>>>>>> And then find all the genes contained within those found terms
>>>>>>>>
>>>>>>>> I scanned
>>>>>>>> GoProfiles
>>>>>>>> GOSemSim
>>>>>>>> GOstats
>>>>>>>> GoTools and
>>>>>>>> TopGO
>>>>>>>>
>>>>>>>> And could not determine that any would do that.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Martin Morgan
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>> -- 
>>>> Martin Morgan
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list