[BioC] How to match Locus IDs with Gene Ontology IDs?

sdurinck@ebi.ac.uk sdurinck at ebi.ac.uk
Wed Nov 23 09:32:25 CET 2005


Hi,

There are two ways to do this in biomaRt.  The result will be the same but
the second way is a very general one and will allow you to extract
anything you want that is available from the BioMart databases.

1)  Using the simple getGO function:

> mart=martConnect()
connected to:  ensembl_mart_35

#Now retrieve the GO annotation for your id, note that you can also give a
vector of id's

> getGO(id="NM_001533",type="refseq",species="hsapiens",mart=mart)
         id       GOID                                     description
evidence
1 NM_001533 GO:0000166                              nucleotide binding    
 IEA
2 NM_001533 GO:0003723                                     RNA binding    
 TAS
3 NM_001533 GO:0006397                                 mRNA processing    
 IEA
4 NM_001533 GO:0005654                                     nucleoplasm    
 TAS
5 NM_001533 GO:0030530 heterogeneous nuclear ribonucleoprotein complex    
 TAS
6 NM_001533 GO:0005634                                         nucleus    
 IEA
           martID
1 ENSG00000104824
2 ENSG00000104824
3 ENSG00000104824
4 ENSG00000104824
5 ENSG00000104824
6 ENSG00000104824

>martDisconnect(mart)


2) Using the more powerful getBM function:


#List the available BioMart databases

> listMarts()
[1] "ensembl_mart_35" "vega_mart_35" "snp_mart_35"     "msd_mart_4"  
"uniprot_mart_17"

#Select a BioMart to use with the function useMart() and verify database
connection.

> mart=useMart("ensembl_mart_35")
connected to:  ensembl_mart_35

#List the available datasets in the selected BioMart database.

> listDatasets(mart)
                      dataset    version
1   ptroglodytes_gene_ensembl     CHIMP1
2        ggallus_gene_ensembl    WASHUC1
3    rnorvegicus_gene_ensembl    RGSC3.4
4    scerevisiae_gene_ensembl       SGD1
5  tnigroviridis_gene_ensembl TETRAODON7
6    xtropicalis_gene_ensembl       JGI3
7      frubripes_gene_ensembl      FUGU2
8  cintestinalis_gene_ensembl   CINT1.95
9       agambiae_gene_ensembl      MOZ2a
10    amellifera_gene_ensembl    AMEL2.0
11       btaurus_gene_ensembl      BDGP4
12      celegans_gene_ensembl     CEL140
13     mmusculus_gene_ensembl    NCBIM34
14   cfamiliaris_gene_ensembl    BROADD1
15 dmelanogaster_gene_ensembl      BDGP4
16        drerio_gene_ensembl     ZFISH5
17      hsapiens_gene_ensembl     NCBI35
18    mdomestica_gene_ensembl       JGI3

#Update the mart object by selecting a dataset

> mart = useDataset(dataset = "hsapiens_gene_ensembl",mart = mart)
Reading database configuration of: hsapiens_gene_ensembl
Checking attributes ... ok
Checking filters ... ok
Checking main tables ... ok

#List the available attributes for this dataset. This shows you all the
attributes that can be retrieved.

> listAttributes(mart)
  [1] "chr_name"
  [2] "chrom_start"
  [3] "chrom_end"
  [4] "chrom_strand"
  
.

#List the available filters for this dataset.  This shows all possible
filters that can be used on this dataset.

> listFilters(mart)
  [1] "chr_name"
  [2] "gene_chrom_start"
  [3] "gene_chrom_end"
  [4] "in_encode"
  
.

#Now you can do a query e.g. Get the GO id, evidence code, and description
for refseq id as filter and with values NM_001533.  Note that you can also
give a vector of identifiers as values.


> getBM(attributes=c("go_id", "go_description","evidence_code"),
filter="refseq_dna",values="NM_001533",mart=mart)
  refseq_dna      go_id                                  go_description
1  NM_001533 GO:0000166                              nucleotide binding
2  NM_001533 GO:0003723                                     RNA binding
3  NM_001533 GO:0006397                                 mRNA processing
4  NM_001533 GO:0005654                                     nucleoplasm
5  NM_001533 GO:0030530 heterogeneous nuclear ribonucleoprotein complex
6  NM_001533 GO:0005634                                         nucleus
  evidence_code
1           IEA
2           TAS
3           IEA
4           TAS
5           TAS
6           IEA


best,
Steffen


> Hi
>
> Earl F. Glynn wrote:
>> I looked at several Bioconductor packages that deal with Gene Ontology
>> (GO,
>> goTools, ontoTools), and I don't seem to find functionality that does
>> the
>> following:
>>
>>
>>
>> Given Locus ID NM_001533 I can go to NCBI
>
>   I think that is a RefSeq ID and I am also pretty sure that LocusLink
> has been retired in favor of Entrez Gene (although we are a bit slow in
> moving).
>
>>
>> http://www.ncbi.nlm.nih.gov/
>>
>> and search "Nucleotide" for "NM_001533"
>>
>>
>>
>> I can click on the NM_0015333 hit returned, and about 2/3rds of the way
>> down
>> the page under the CDS section, the go_component, go_function, and
>> go_process subsections give Gene Ontology info for NM_0015333.
>>
>
>   biomaRt might be your best choice
>
>>
>>
>> Likewise, if I do the same thing with Locus ID BC001721, I see a hit and
>> a
>> CDS section, but no gene ontology information.  That's OK, I'm not
>> expecting
>> everything to have GO information.  (E.g, of the 45,101 probesets on the
>> Mouse430_2 Affy chip, only about 4693 have GO Biological process
>> information, 2573 have celleular info, and 4875 have molecular function
>> info.  I'm not working with Affy data, but I know many IDs won't have GO
>> info, but some will.)
>
>   Again I do not believe that BC001721 is an Entrez Gene ID, and it does
> matter a bit.
>
>    You can of course always use AnnBuilder to build your own annotation
> for a microarray (if that is what you are working off).
>
>   Robert
>
>>
>>
>>
>> If I have a long list of Locus IDs, e.g., NM_001533, BC001721, ., are
>> there
>> any Bioconductor packages that "connect" these identifiers to gene
>> ontology
>> identifiers, or perhaps some other identifier (say LocusLink, aka
>> Enterez
>> Gene) that is mapped to the Gene Ontology information?
>>
>>
>>
>> Thanks for any suggestions on how this might be automated using
>> Bioconductor
>> and R.
>>
>>
>>
>> Earl F. Glynn
>>
>> Scientific Programmer
>>
>> Bioinformatics Department
>>
>> Stowers Institute for Medical Research
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>



More information about the Bioconductor mailing list