[BioC] Mapping genomic coordinates to transcript coordinates? (revived)

Pages, Herve hpages at fhcrc.org
Thu Mar 3 08:58:11 CET 2011


Hi Chris, Malcolm,

There is the transcriptLocs2refLocs() function in Biostrings that
does the reverse mapping i.e. it maps transcript coordinates to
genomic coordinates. There is no doubt that the GenomicFeatures
package would be a better place for this function so we should move
it there.

Here is how it works:

Let's say we have 3 transcripts, 2 on the + strand and 1 on the -
strand, with the following exons (1-based genomic coordinates):

exonStarts <- list(
    c(2401L, 3922L, 4806L),
    c(2401L, 4806L),
    c(6291L, 5278L)
)

exonEnds <- list(
    c(3344L, 4421L, 5200L),
    c(3344L, 5200L),
    c(6500L, 5899L)
)

strand <- c("+", "+", "-")

The lengths of the transcripts are:

> sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L))
[1] 1839 1339  832

Let's say we have positions on each transcript that we want to
map to the genome. Those positions are stored in a list of 3
integer vectors:

tlocs <- list(
    c(1L, 200L, 1000L, 1839L),
    940:950,
    c(1L, 832L)
)

All those positions can be converted into genomic coordinates with
transcriptLocs2refLocs():

> library(Biostrings)
> transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand)
[[1]]
[1] 2401 2600 3977 5200

[[2]]
 [1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811

[[3]]
[1] 6500 5278

It's vectorized and fast (implemented in C).

Unfortunately we don't have a refLocs2transcriptLocs() function at
the moment for going the other way around but, yes, that's something
we should definitely have. When called on the previous result and with
the same 'exonStarts', 'exonEnds' and 'strand' values, it should return
the original 'tlocs'.

There would be 2 complications for such a refLocs2transcriptLocs though:

  1. If the genomic location doesn't hit the transcript. Not a big deal,
     NA could be used for this.

  2. Sometimes (very rarely) the genomic location hits an ambiguous
     location on the transcript (e.g. for a small number of transcripts
     in UCSC knownGene track, some exons overlap). What to do then?

Also those 2 functions should really be in GenomicFeatures, not
in Biostrings, and their interface should be modernized to accept
a GRangesList object instead of exonStarts, exonEnds and strand
(the transcriptLocs2refLocs() function predates the GenomicRanges
era).

Here in Seattle we didn't work on this yet because of lack of time
and also because there was apparently no demand for it so far. For
now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures
so it's more visible and it will also make it easier for someone
interested to contribute.

H.


----- Original Message -----
From: "Chris Fields" <cjfields at illinois.edu>
To: "Malcolm Cook" <MEC at stowers.org>
Cc: "lawrence.michael at gene.com" <lawrence.michael at gene.com>, "amackey at virginia.edu" <amackey at virginia.edu>, "bioconductor at r-project.org" <bioconductor at r-project.org>
Sent: Friday, February 25, 2011 7:45:11 AM
Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates?	(revived)

+1 (from the bioperl end).  Would be nice to have a straightforward BioC-based way of doing this.  Aaron's suggestion of a similar (maybe simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out in the thread it is a bit complex).

chris

On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote:

> Hiya,
> 
> I am reviving this thread
> 
>        https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html
> 
> and am poised to follow the lead proposed by Marc Carlson
> 
> but before I do..... has anyone beaten me to it?
> 
> And/or does anyone have a suggestion as to how to best contribute such an effort back ....
> 
> ... as an addition to GenomicFeatures perhaps
> 
> Best,
> 
> Malcolm Cook
> Stowers Institute for Medical Research -  Bioinformatics
> Kansas City, Missouri  USA
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list