[BioC] Search queries with biomaRt does not align with online queries via ensembl

James W. MacDonald jmacdon at med.umich.edu
Mon Mar 1 16:55:27 CET 2010


Hi Tony,

ATF4 isn't a valid gene name, it's a HUGO gene symbol. The gene name can 
be retrieved using the 'description' attribute. So you have to know that 
ATF4 is a gene symbol, and that Ensembl calls these things hgnc_symbols.

But your question still remains. How to decide which of the often 
inscrutable filters/attributes should one use to get a set of results? 
This is compounded by the fact that Ensembl will sometimes change what 
they call things. For instance, hgnc_symbol was once simply symbol. And 
for a while there, one had to know that for humans you used symbol, but 
for mice you used mgi_symbol...

There isn't a quick answer to this question. Steffen added a second 
column to the output of both listFilters() and listAttributes() that may 
help (although often times it is the same as the first, minus the 
underscores). What it often comes down to is trial and error, choosing 
different attributes that might plausibly return what you want.

One strategy I use is to try the shortest possible attribute name that 
might describe what I want. It seems the more descriptors are added to a 
given attribute, the less data on the back end. So for instance, 
something like hgnc_automatic_gene_name would be quite low on a list of 
attributes that I would explore. OTOH, "curated" might be more useful, 
so hgnc_curated_gene_name to me is more likely to bear fruit.

 > getBM(c("hgnc_symbol","description","hgnc_curated_gene_name"), 
"hgnc_symbol", "ATF4", mart)
   hgnc_symbol
1        ATF4
 
 
 
 
     description
1 Cyclic AMP-dependent transcription factor ATF-4 (cAMP-dependent 
transcription factor ATF-4)(Activating transcription factor 
4)(DNA-binding protein TAXREB67)(Cyclic AMP-responsive element-binding 
protein 2)(cAMP-responsive element-binding protein 2)(CREB-2) 
[Source:UniProtKB/Swiss-Prot;Acc:P18848]
   hgnc_curated_gene_name
1                   ATF4

Best,

Jim





Tony Chiang wrote:
> Thanks Hans,
> 
> That worked much better. Quick follow up question then (I guess for anyone
> who might know the answer), when would we use the hgnc gene names rather the
> the symbols? It would appear that ATF4 is a valid hgnc gene name so I
> thought that the obvious choice would have been to filter based on
> hgnc_automatic_gene_name but this is obviously not the case. I guess what I
> am trying to ask is how do I know what to use as the filter when it would
> seem like there is an obvious candidate to chose but is not the correct one?
> 
> Cheers,
> --Tony
> 
> 
> On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh at fmi.ch> wrote:
> 
>>
>>
>> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at fhcrc.org> wrote:
>>
>>> Hi Steffen et al,
>>>
>>> Quick question about a search query via biomaRt. Here is the code that I
>> am
>>> using:
>>>
>>> *****
>>> library(biomaRt)
>>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
>>> filters = listFilters(ensembl)
>>> attributes = listAttributes(ensembl)
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene",
>>>                "ensembl_gene_id", "hgnc_automatic_gene_name"),
>>>                filters="hgnc_automatic_gene_name", values="ATF4",
>>>                mart=ensembl)
>>> *****
>> try ' filters="hgnc_symbol" ', eg:
>>
>>
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
>> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4",
>> mart=ensembl)
>>   ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
>> 1    ENSP00000384587        468 ENSG00000128272                       NA
>> 2    ENSP00000336790        468 ENSG00000128272                       NA
>> 3    ENSP00000379912        468 ENSG00000128272                       NA
>>
>>
>> Hans
>>
>>> For me, this returns an empty data frame. But when I query ATF4 online at
>>> ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO)
>>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be
>> fine.
>>> I guess the only other reason that I can see is which dataset I use in
>> the
>>> useMart function. I am guessing that the online API will search through
>> all
>>> datasets while I am only specifying a single one? If this is true, do you
>>> know of a sensible work around? I have about 150 genes that I would like
>>> mapped to the EBML ID names but using the code above with a vector of
>> gene
>>> names, I can only map around 25...but if I manually query for some of the
>>> non-mapped gene names, I get what I am after. If I am wrong about my
>> guess
>>> in the dataset, can you let me know what you think might be going on?
>>>
>>> Tony
>>>
>>>> sessionInfo()
>>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993)
>>> i386-apple-darwin10.2.0
>>>
>>> locale:
>>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8
>>>
>>> attached base packages:
>>> [1] grid      stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>>  [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6   Rgraphviz_1.25.1
>>>  [4] biomaRt_2.3.0        GOstats_2.13.0       RSQLite_0.8-1
>>>  [7] DBI_0.2-5            Category_2.13.0      AnnotationDbi_1.9.4
>>> [10] Biobase_2.7.3        RBGL_1.23.0          graph_1.25.5
>>>
>>> loaded via a namespace (and not attached):
>>>  [1] annotate_1.25.1   genefilter_1.29.5 GO.db_2.3.5       GSEABase_1.9.0
>>>  [5] RCurl_1.3-1       splines_2.11.0    survival_2.35-8   tools_2.11.0
>>>  [9] XML_2.6-0         xtable_1.5-6
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list