[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived

Thu Jul 24 17:24:18 CEST 2014

Dear Valerie,

sorry for my long delay in reacting to your message below. This addition 
to VariantAnnotation by Michael and you is great. I just would like to 
make a comment and report what looks to me as a bug related to the 
computation of cDNA positions, just in case you have suggestions and 
comments, specially with respect to what at the moment seems like a bug 
to me.

the way in which mapCoords() works is by finding all possible overlaps 
of each GRanges entry, corresponding to a variant. This is strictly not 
necessary once you have gone through locateVariants() and/or 
predictCoding() since these functions already find all possible overlaps 
of each variant to different transcripts and regions. So, to get the 
corresponding annotation of cDNA positions I had to write the following 
function which is slightly more elaborated from what you suggest below:

## this function assumes that locateVariants() or predictCoding()
## have been called so that there is a TXID metadata column
.cDNAloc <- function(gr, txdb) {
   if (!"TXID" %in% colnames(mcols(gr)))
     stop("cDNApos: metadata column 'TXID' not found in input GRanges 
object.")

   exonsbytx <- exonsBy(txdb, by="tx")
   map <- mapCoords(gr, exonsbytx, eltHits=TRUE)
   eolap <- map$eltHits
   qolap <- map$queryHits
   txids <- rep(names(exonsbytx), elementLengths(exonsbytx))[eolap]
   res <- gr[qolap]
   mcols(res) <- append(values(res),
                        DataFrame(cDNALOC=ranges(map),
                                  TXID2=as.integer(txids)))
   ## restrict results to cDNA positions of query TXID
   res <- res[res$TXID == res$TXID2]

   start(res$cDNALOC)
}

i'm going to run it with a toy example of a variant that overlaps a 
coding region of 7 transcripts in the positive strand and the three UTR 
region of one transcript in the negative strand.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
                                width=c(1, 1)), strand=c("+", "-"))
gr$TXID <- c(72816L, 73335L)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
loc <- locateVariants(gr, txdb, AllVariants())
loc$cDNALOC <- .cDNAloc(loc, txdb)

now let's look to the cDNA position of the transcript in the negative 
strand (the last entry of the resulting GRanges object):

GRanges with 1 range and 2 metadata columns:
       seqnames               ranges strand |      TXID   cDNALOC
          <Rle>            <IRanges>  <Rle> | <integer> <integer>
   [1]    chr21 [43522349, 43522349]      - |     73335      2036

and the exonic structure of this transcript

exonsbytx <- exonsBy(txdb, by="tx")
exonsbytx[["73335"]]
GRanges with 2 ranges and 3 metadata columns:
       seqnames               ranges strand |   exon_id   exon_name 
exon_rank
          <Rle>            <IRanges>  <Rle> | <integer> <character> 
<integer>
   [1]    chr21 [43528406, 43528644]      - |    260152        <NA> 
     1
   [2]    chr21 [43522244, 43524145]      - |    260151        <NA> 
     2

from this structure one would expect that the reported cDNA position 
would be:

43524145 - 43522349 + 1 = 1797

instead of 2036.

after investigating a bit where this value may come from, it seems that 
while the strand is being taken into account for the relative position 
within the exon, the widths of upstream exons is wrongly selected (in 
this case there are no upstream exons):

width(exonsbytx[["73335"]])
[1]  239 1902

239 + 1796 + 1 = 2036

so, my guess is that this must be something to be fixed in mapCoords().

best regards!!

robert.

On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
> Hi,
>
> Following up on this thread:
>
> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
>
> These changes are available in VariantAnnotation 1.11.15:
>
> 1) LOCSTART, LOCEND
>
> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
> These are LOCATION-centric coordinates and can be different for each row
> so I thought these names were more descriptive than REFLOCS (discussed
> in thread). We have 2 values (start/end) instead of a single column of
> IRanges() because we can't make an IRanges() with missing values.
> Technically 'missing' ranges are represented by zero-width ranges but we
> still need a position; there is no position because there was no overlap.
>
> 2) mapCoords(), pmapCoords()
>
> These functions are courtesy of Michael. mapCoords() maps ranges onto
> another set of coordinates. You can map to cds-centric, exon-centric or
> any other type of coordinate. See ?mapCoords in both GenomicRanges and
> GenomicAlignments.
>
>
> In the previous thread we discussed added cDNA locations to
> predictCoding(). I've decided against this because it adds the
> additional overhead of the exonsBy() extraction and a findOverlaps()
> call. Not all users want the cDNA locations and those that do can now
> easily get them with mapCoords().
>
> ## The usual predictCoding setup:
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> fl <- system.file("extdata", "chr22.vcf.gz",
> package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> vcf <- renameSeqlevels(vcf, "chr22")
> coding <- predictCoding(vcf, txdb, Hsapiens)
>
> ## Exon-centric or cDNA locations:
> exonsbytx <- exonsBy(txdb, "tx")
> cDNA <- mapCoords(coding[!duplicated(ranges(coding))], exonsbytx)
> coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
>
> Let me know if you run into problems or if the docs need more detail.
>
> Valerie
>

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550