[BioC] Search queries with biomaRt does not align with online queries via ensembl

Hotz, Hans-Rudolf hrh at fmi.ch
Mon Mar 1 16:47:10 CET 2010




On 3/1/10 4:07 PM, "Tony Chiang" <tchiang at fhcrc.org> wrote:

> Thanks Hans,
> 
> That worked much better. Quick follow up question then (I guess for anyone
> who might know the answer), when would we use the hgnc gene names rather the
> the symbols? It would appear that ATF4 is a valid hgnc gene name



as far as I understand 'hgnc_symbol' should always work (if the symbol does
exist). The HGNC does assign (or rather approve) 'symbols', and 'names'
refer to written out names, see:

http://www.genenames.org/data/hgnc_data.php?hgnc_id=786


Ensembl uses the HGNC symbol as 'Name', see:

http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000128272

 => notice the label 'curated'
  
Hence for this particular symbol, you can also use the biomart filter
"hgnc_curated_gene_nam", eg:

> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", values="ATF4",
mart=ensembl)
  ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
1    ENSP00000384587        468 ENSG00000128272                       NA
2    ENSP00000336790        468 ENSG00000128272                       NA
3    ENSP00000379912        468 ENSG00000128272                       NA
> 

How ever, if you look at 'IGHA2', see:

http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000211890

 -> notice the label 'automatic'

Hence, the biomart filter "hgnc_curated_gene_name" will not work, but the
biomart filter "hgnc_curated_automatic_name" will work, eg:


> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", values="IGHA2",
mart=ensembl)
[1] ensembl_peptide_id       entrezgene               ensembl_gene_id
[4] hgnc_automatic_gene_name
<0 rows> (or 0-length row.names)
> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_automatic_gene_name", values="IGHA2",
mart=ensembl)
  ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
1    ENSP00000418606         NA ENSG00000211890                    IGHA2
2    ENSP00000374980         NA ENSG00000211890                    IGHA2
3    ENSP00000374981         NA ENSG00000211890                    IGHA2
> 

and 'hgnc_symbol' always work, eg:



> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_symbol", values="IGHA2",
mart=ensembl)
  ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
1    ENSP00000418606         NA ENSG00000211890                    IGHA2
2    ENSP00000374980         NA ENSG00000211890                    IGHA2
3    ENSP00000374981         NA ENSG00000211890                    IGHA2
> 




Now, the follow up question is: how does ensembl distinguish between
'curated' and 'automatic'? well, I am no more fully familiar with ensembl,
but I assume, that the entry for IGHA2 has no (not yet) support from their
manual curators...there is also no link back to vega on the HGNC web page
for 'IGHA2', and there is one for 'ATF4'


I hope this clarifies the situation

Hans




> so I
> thought that the obvious choice would have been to filter based on
> hgnc_automatic_gene_name but this is obviously not the case. I guess what I
> am trying to ask is how do I know what to use as the filter when it would
> seem like there is an obvious candidate to chose but is not the correct one?
> 
> Cheers,
> --Tony
> 
> 
> On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh at fmi.ch> wrote:
> 
>> 
>> 
>> 
>> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at fhcrc.org> wrote:
>> 
>>> Hi Steffen et al,
>>> 
>>> Quick question about a search query via biomaRt. Here is the code that I
>> am
>>> using:
>>> 
>>> *****
>>> library(biomaRt)
>>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
>>> filters = listFilters(ensembl)
>>> attributes = listAttributes(ensembl)
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene",
>>>                "ensembl_gene_id", "hgnc_automatic_gene_name"),
>>>                filters="hgnc_automatic_gene_name", values="ATF4",
>>>                mart=ensembl)
>>> *****
>> 
>> try ' filters="hgnc_symbol" ', eg:
>> 
>> 
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
>> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4",
>> mart=ensembl)
>>   ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
>> 1    ENSP00000384587        468 ENSG00000128272                       NA
>> 2    ENSP00000336790        468 ENSG00000128272                       NA
>> 3    ENSP00000379912        468 ENSG00000128272                       NA
>>> 
>> 
>> 
>> 
>> Hans
>> 
>>> For me, this returns an empty data frame. But when I query ATF4 online at
>>> ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO)
>>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be
>> fine.
>>> I guess the only other reason that I can see is which dataset I use in
>> the
>>> useMart function. I am guessing that the online API will search through
>> all
>>> datasets while I am only specifying a single one? If this is true, do you
>>> know of a sensible work around? I have about 150 genes that I would like
>>> mapped to the EBML ID names but using the code above with a vector of
>> gene
>>> names, I can only map around 25...but if I manually query for some of the
>>> non-mapped gene names, I get what I am after. If I am wrong about my
>> guess
>>> in the dataset, can you let me know what you think might be going on?
>>> 
>>> Tony
>>> 
>>>> sessionInfo()
>>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993)
>>> i386-apple-darwin10.2.0
>>> 
>>> locale:
>>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8
>>> 
>>> attached base packages:
>>> [1] grid      stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> 
>>> other attached packages:
>>>  [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6   Rgraphviz_1.25.1
>>>  [4] biomaRt_2.3.0        GOstats_2.13.0       RSQLite_0.8-1
>>>  [7] DBI_0.2-5            Category_2.13.0      AnnotationDbi_1.9.4
>>> [10] Biobase_2.7.3        RBGL_1.23.0          graph_1.25.5
>>> 
>>> loaded via a namespace (and not attached):
>>>  [1] annotate_1.25.1   genefilter_1.29.5 GO.db_2.3.5       GSEABase_1.9.0
>>>  [5] RCurl_1.3-1       splines_2.11.0    survival_2.35-8   tools_2.11.0
>>>  [9] XML_2.6-0         xtable_1.5-6
>>> 
>>> [[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>>



More information about the Bioconductor mailing list