[BioC] RefSeq coordinates from biomaRt

Dave Tang davetingpongtang at gmail.com
Mon Nov 25 09:47:23 CET 2013


Hello,

I've been using biomaRt to fetch genomic coordinates of RefSeqs (perhaps  
in an incorrect manner?). I found that the RefSeq coordinates don't match  
the coordinates provided at the UCSC Genome Browser (NM_033453 at  
chr20:3190006-3204516):

library("biomaRt")
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
getBM(attributes=c('refseq_mrna','chromosome_name','start_position','end_position','strand'),
filters = 'refseq_mrna', values = 'NM_033453', mart = ensembl)

     refseq_mrna chromosome_name start_position end_position strand
1   NM_033453              20        3189514      3204516      1

The coordinates seem to match this Ensembl transcript (ENST00000483354)  
instead:

getBM(attributes=c('ensembl_transcript_id','chromosome_name','start_position','end_position','strand'),
filters = 'ensembl_transcript_id', values = 'ENST00000483354', mart =
ensembl)

     ensembl_transcript_id chromosome_name start_position end_position  
strand
1       ENST00000483354              20        3189514      3204516      1

Here's another RefSeq model, NM_181493, which should be mapped to  
chr20:3190134-3204516:

getBM(attributes=c('refseq_mrna','chromosome_name','start_position','end_position','strand'),
filters = 'refseq_mrna', values = 'NM_181493', mart = ensembl)

     refseq_mrna chromosome_name start_position end_position strand
1   NM_181493              20        3189514      3204516      1

So it seems the RefSeq IDs are mapped to the longest Ensembl transcript  
model that covers the RefSeq model. I searched around the web and looked  
at different available marts but nothing obvious popped out. How should I  
go about obtaining RefSeq coordinates using biomaRt? Or biomaRt is Ensembl  
centric?

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.16.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 tools_3.0.2    XML_3.98-1.1

Cheers,


-- 
Dave



More information about the Bioconductor mailing list