[BioC] RefSeq coordinates from biomaRt

Dave Tang davetingpongtang at gmail.com
Mon Nov 25 13:39:12 CET 2013


On Mon, 25 Nov 2013 19:31:22 +0900, Sean Davis <sdavis2 at mail.nih.gov>  
wrote:

> Hi, Dave.
>
> There may be multiple issues going on here, so you'll have to do some  
> digging yourself when discrepancies arise like you see here. Working  
> through your first example, keep in mind that neither Ensembl or UCSC  
> are the actual curators of the RefSeq transcripts. NCBI is the source of  
> that annotation. So, if you go to NCBI gene and search for NM_033453 and  
> then play a bit with the Genomic Sequence Viewer, you'll note that the  
> Gene (protein NP_258412.1) is mapped with the coordinates given at UCSC  
> while the mRNA is mapped with the coordinates given by Ensembl. Add to  
> this complication that UCSC does its own mapping of the transcripts  
> (even RefSeq) and you could even have a "unique" set of coordinates  
> given by UCSC (ie., not the same as NCBI or Ensembl).

Hi Sean,

thank you for the prompt reply.

My aim is to have a set of transcript annotations as opposed to gene  
annotations; I don't really mind whether they are RefSeqs or Ensembl  
transcript models. But I keep running into the same problem where the  
coordinates of either Ensembl or RefSeq transcripts are the coordinates of  
the Ensembl gene that encompasses all the transcripts, i.e. the longest  
Ensembl gene. Here's another example:

library("biomaRt")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
#ENST00000398344 is on chr22:24,313,554-24,316,773
getBM(attributes = c('chromosome_name',
                      'start_position',
                      'end_position',
                      'strand'
                     ),
                     filters = 'ensembl_transcript_id',
                     values = 'ENST00000398344',
                     mart = ensembl)
   chromosome_name start_position end_position strand
1              22       24313554     24322660     -1

#ENST00000430101 is on chr22:24,315,293-24,316,648
getBM(attributes = c('chromosome_name',
                      'start_position',
                      'end_position',
                      'strand'
                     ),
                     filters = 'ensembl_transcript_id',
                     values = 'ENST00000430101',
                     mart = ensembl)
   chromosome_name start_position end_position strand
1              22       24313554     24322660     -1

Is it possible to obtain genomic coordinates of Ensembl transcript via  
biomaRt?

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252     
LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.18.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 tools_3.0.2    XML_3.98-1.1

Cheers,


-- 
Dave



More information about the Bioconductor mailing list