[BioC] Mapping NCBI accession numbers to GO terms

Martin Morgan mtmorgan at fhcrc.org
Fri May 21 14:56:41 CEST 2010


On 05/21/2010 04:45 AM, James F. Reid wrote:
> Hi Steve,
> 
> Term(names(get(get("NM_172496", org.Mm.egREFSEQ2EG), org.Mm.egGO)))
>            GO:0001843            GO:0005515
> "neural tube closure"     "protein binding"

I'm partial to

  library(org.Mm.eg.db)  # organism-specific library
  library(GO.db)         # GO ontology

  ## a vector of REFSEQ ids. in org.*eg.db packages the 'Lkey' is the
  ## 'eg' part of the package name, i.e., the ENTREZ gene id, while
  ## 'Rkey' is the part of the thing that is getting mapped to,
  ## 'mappedRkeys' are those keys that are in the present map
  ## so here we get the first three REFSEQ ids, to be used as
  ## an example
  rids <- mappedRkeys(head(org.Mm.egREFSEQ2EG, 3))

Then the maps

  egids <- org.Mm.egREFSEQ2EG[rids]        # REFSEQ to ENTREZ id
  goids <- org.Mm.egGO[mappedLkeys(egids)] # ENTREZ to GO id
  terms <- GOTERM[mappedRkeys(goids)]      # GO to TERM

we could see what we've got, e.g.,

  toTable(terms)

or maybe

  unique(toTable(terms)[,c("go_id", "Term")])

or more explicitly

  r2eg <- toTable(egids)
  eg2go <- toTable(goids)
  go2term <- unique(toTable(terms)[,c('go_id', 'Term')])
  merge(merge(r2eg, eg2go), go2term)

The first few lines of which are

> head(merge(merge(r2eg, eg2go), go2term))
       go_id gene_id    accession Evidence Ontology                Term
1 GO:0001666  235623 NM_001001144      IMP       BP response to hypoxia
2 GO:0003674   19783    NG_005612       ND       MF  molecular_function
3 GO:0003674   22746 NM_001001130       ND       MF  molecular_function
4 GO:0005515  235623 NM_001001144      IPI       MF     protein binding
5 GO:0005575   19783    NG_005612       ND       CC  cellular_component
6 GO:0005575   22746 NM_001001130       ND       CC  cellular_component

An alternative to map a single key might be

  Term(names(org.Mm.egGO[[ org.Mm.egREFSEQ2EG[["NM_172496"]] ]]))

Martin

> 
> HTH,
> J.
> 
> On 05/21/2010 12:31 PM, Steve Taylor wrote:
>> Hi,
>>
>> I too would like a simple way of getting from Refseq to GOTERM(s).
>>
>> What's the best package (and an example if possible) for getting the
>> actual term information (rather than the GO ID as below) from a Refseq
>> ID?
>>
>> Thanks,
>>
>> Steve
>>
>>>
>>>> Hello,
>>>>
>>>> I'm not sure how to retrieve GO terms associated with the NCBI
>>>> accession numbers (such as "NM_172496").
>>>>
>>>> I have found references to GOLOCUSID, but I cannot find this
>>>> environment. I have GOstats and I can access GOTERM, but not
>>>> GOLOCUSID.
>>>>
>>>>
>>> Perhaps this will get you going:
>>>
>>>> library(org.Mm.eg.db)
>>>> get("NM_172496", org.Mm.egREFSEQ2EG)
>>> [1] "12808"
>>>> names(get("12808", org.Mm.egGO))
>>> [1] "GO:0001843" "GO:0005515"
>>>
>>>> sessionInfo()
>>> R version 2.12.0 Under development (unstable) (2010-05-03 r51901)
>>> x86_64-apple-darwin10.3.0
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices datasets tools utils methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] org.Mm.eg.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-0
>>> [4] DBI_0.2-5 AnnotationDbi_1.11.1 Biobase_2.9.0
>>> [7] weaver_1.15.0 codetools_0.2-2 digest_0.4.2
>>>
>>>
>>>
>>>> Anyways, I also failed to map NCBI accession numbers to Entrez IDs
>>>> using BioIDMapper:
>>>>
>>>
>>> Not bioconductor; please contact the author of that package for concerns
>>> about it.
>>>
>>>
>>>>
>>>> library(BioIDMapper)
>>>> data(glist)
>>>>> head( bio.convert( glist, 1, 24 ) )
>>>> Parsing data from UniProt
>>>> 200 IDs have been processed
>>>> 159 IDs have been processed
>>>> Parsing data from UniProt
>>>> 22 IDs have been processed
>>>> No ID found in database. 0 IDs have been processed
>>>> Done...
>>>> P_GI ACC P_ENTREZGENEID
>>>> 1 "54125119" "A6YK35\r" NA
>>>> 2 "54125311" "A6YK35\r" NA
>>>> 3 "54125051" "A6YK35\r" NA
>>>> 4 "54125369" "A6YK35\r" NA
>>>> 5 "54125435" "A7J4K5\r" NA
>>>> 6 "54125083" "A6YK35\r" NA
>>>>>
>>>>
>>>> Best regards,
>>>>
>>>> confused January
>>>>
>>>> -- 
>>>> -------- Dr. January Weiner 3 --------------------------------------
>>>> Max Planck Institute for Infection Biology
>>>> Charitéplatz 1
>>>> D-10117 Berlin, Germany
>>>> Web : www.mpiib-berlin.mpg.de
>>>> Tel : +49-30-28460514
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list