[BioC] Mapping genomic coordinates to transcript coordinates? (revived)

Pages, Herve hpages at fhcrc.org
Thu Mar 17 02:13:21 CET 2011


Hi Michael,

The naming of the function and its arguments makes it sound restricted to
transcripts and exons, but, currently, the function can use any set of
compound ranges. For example, if those ranges are in a GRangesList
object 'grl', then use 'start(grl)' for the exons starts and 'end(grl)'
for the exon ends. One restriction is that only one strand value can
be assigned for any given group of ranges (i.e. any top-level element in
the GRangesList). Assuming 'grl' satisfies this (and I guess that would
be the case for spliced reads), then you can extract the value to pass
to the strand argument with something like

  sapply(grl, function(gr) strand(gr)[1L])

(there are more efficient ways to do this)

As I said previously transcriptLocs2refLocs() is low-level and would need
to be improved or wrapped into some higher-level tool (generic function?)
that would work on GRangesList, GappedAlignments, and maybe other high-level
representations of sets of compound ranges.

Cheers,
H.


----- Original Message -----
From: "Michael Lawrence" <lawrence.michael at gene.com>
To: "Herve Pages" <hpages at fhcrc.org>
Cc: "Chris Fields" <cjfields at illinois.edu>, bioconductor at r-project.org
Sent: Wednesday, March 16, 2011 4:16:56 PM
Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived)




On Wed, Mar 2, 2011 at 11:58 PM, Pages, Herve < hpages at fhcrc.org > wrote: 


Hi Chris, Malcolm, 

There is the transcriptLocs2refLocs() function in Biostrings that 
does the reverse mapping i.e. it maps transcript coordinates to 
genomic coordinates. There is no doubt that the GenomicFeatures 
package would be a better place for this function so we should move 
it there. 

Here is how it works: 

Let's say we have 3 transcripts, 2 on the + strand and 1 on the - 
strand, with the following exons (1-based genomic coordinates): 

exonStarts <- list( 
c(2401L, 3922L, 4806L), 
c(2401L, 4806L), 
c(6291L, 5278L) 
) 

exonEnds <- list( 
c(3344L, 4421L, 5200L), 
c(3344L, 5200L), 
c(6500L, 5899L) 
) 

strand <- c("+", "+", "-") 

The lengths of the transcripts are: 

> sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L)) 
[1] 1839 1339 832 

Let's say we have positions on each transcript that we want to 
map to the genome. Those positions are stored in a list of 3 
integer vectors: 

tlocs <- list( 
c(1L, 200L, 1000L, 1839L), 
940:950, 
c(1L, 832L) 
) 

All those positions can be converted into genomic coordinates with 
transcriptLocs2refLocs(): 




This is a cool function but can't it be generalized for any compound set of ranges, like any GRangesList? For example, one could want to translate read-relative positions in a spliced read alignment to global positions, or the other way around. Especially the other way around. Returning an IntegerList would be one way to handle multiple possibilities. 




> library(Biostrings) 
> transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand) 
[[1]] 
[1] 2401 2600 3977 5200 

[[2]] 
[1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811 

[[3]] 
[1] 6500 5278 

It's vectorized and fast (implemented in C). 

Unfortunately we don't have a refLocs2transcriptLocs() function at 
the moment for going the other way around but, yes, that's something 
we should definitely have. When called on the previous result and with 
the same 'exonStarts', 'exonEnds' and 'strand' values, it should return 
the original 'tlocs'. 

There would be 2 complications for such a refLocs2transcriptLocs though: 

1. If the genomic location doesn't hit the transcript. Not a big deal, 
NA could be used for this. 

2. Sometimes (very rarely) the genomic location hits an ambiguous 
location on the transcript (e.g. for a small number of transcripts 
in UCSC knownGene track, some exons overlap). What to do then? 

Also those 2 functions should really be in GenomicFeatures, not 
in Biostrings, and their interface should be modernized to accept 
a GRangesList object instead of exonStarts, exonEnds and strand 
(the transcriptLocs2refLocs() function predates the GenomicRanges 
era). 

Here in Seattle we didn't work on this yet because of lack of time 
and also because there was apparently no demand for it so far. For 
now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures 
so it's more visible and it will also make it easier for someone 
interested to contribute. 

H. 





----- Original Message ----- 
From: "Chris Fields" < cjfields at illinois.edu > 
To: "Malcolm Cook" < MEC at stowers.org > 
Cc: " lawrence.michael at gene.com " < lawrence.michael at gene.com >, " amackey at virginia.edu " < amackey at virginia.edu >, " bioconductor at r-project.org " < bioconductor at r-project.org > 
Sent: Friday, February 25, 2011 7:45:11 AM 
Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) 

+1 (from the bioperl end). Would be nice to have a straightforward BioC-based way of doing this. Aaron's suggestion of a similar (maybe simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out in the thread it is a bit complex). 

chris 

On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote: 

> Hiya, 
> 
> I am reviving this thread 
> 
> https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html 
> 
> and am poised to follow the lead proposed by Marc Carlson 
> 
> but before I do..... has anyone beaten me to it? 
> 
> And/or does anyone have a suggestion as to how to best contribute such an effort back .... 
> 
> ... as an addition to GenomicFeatures perhaps 
> 
> Best, 
> 
> Malcolm Cook 
> Stowers Institute for Medical Research - Bioinformatics 
> Kansas City, Missouri USA 
> 
> 
> 
> 
> [[alternative HTML version deleted]] 
> 
> _______________________________________________ 
> Bioconductor mailing list 
> Bioconductor at r-project.org 
> https://stat.ethz.ch/mailman/listinfo/bioconductor 
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor 

_______________________________________________ 
Bioconductor mailing list 
Bioconductor at r-project.org 
https://stat.ethz.ch/mailman/listinfo/bioconductor 
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor 

_______________________________________________ 
Bioconductor mailing list 
Bioconductor at r-project.org 
https://stat.ethz.ch/mailman/listinfo/bioconductor 
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list