[BioC] converting position from '-' strand to '+' strand

Hervé Pagès hpages at fhcrc.org
Mon Apr 5 21:57:37 CEST 2010


Hi Tim,

Tim Smith wrote:
> Apologies if this seems like a trivial question.
> 
> I wanted to have a consistent set of locations and wanted to all the locations to begin from the 5' end. How can I convert locations that are given for the '-' strand?
> For example:
> -----------------------------
> library(biomaRt)
> 
> mart.obj <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
> 
> atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand')
> 
> mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", mart=mart.obj)
> mir.locs[1:5,]
>>  ensembl_gene_id chromosome_name start_position end_position strand
> 1 ENSG00000222732               5      171706206    171706319      1
> 2 ENSG00000207864               9       97847727     97847823      1
> 3 ENSG00000221173               9      129338809    129338909     -1
> 4 ENSG00000222961               5       32379501     32379581     -1
> 5 ENSG00000221058              18       51612956     51613026     -1
> 
> 
> ----------------------------
> Is there a quick way that I can convert the last 3 rows so that they reflect positions from the 5' strand?
> many thanks!

Note that Ensembl here is using the widely adopted convention (at
UCSC, Ensembl, in GFF files, in Bioconductor, etc...) to report
feature coordinates relatively to the 5' end of the plus strand
(and to call "start position" the position of their leftmost base)
even for features that belong to the minus strand. There are a lot
of advantages to keep that representation all the way down your
analysis.

Anyway, if you really want to do that conversion, then you need to
know the chromosome lengths for the current Hsapiens genome at
Ensembl. You can easily get them with the devel version of the
GenomicFeatures package (you need R-2.11 + BioC devel):

Make a TranscriptDb object from the 'hsapiens_gene_ensembl' dataset:

   library(GenomicFeatures)
   txdb <- makeTranscriptDbFromBiomart()

Then:

   > seqlengths(txdb)[1:6]
           5        19        10         4         8        20
   180915260  59128983 135534747 191154276 146364022  63025520

Then use something like (untested):

   idx <- which(mir.locs$strand == -1)
   idx2seqlengths <- seqlengths(txdb)[mir.locs$chromosome_name[idx]]
   tmp <- mir.locs$start_position[idx]
   mir.locs$start_position[idx] <-
     idx2seqlengths - mir.locs$end_position[idx] + 1L
   mir.locs$end_position[idx] <- idx2seqlengths - tmp + 1L

Cheers,
H.

> 
> 
>       
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list