[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived

Mon Jul 28 19:25:06 CEST 2014

Hi Robert,

Now fixed in GenomicRanges 1.17.28.

I introduced this bug when trying to be too clever with ignore.strand. 
Thanks for catching it and, as always, providing a clear reproducible 
example.

Valerie

On 07/24/2014 08:24 AM, Robert Castelo wrote:
> Dear Valerie,
>
> sorry for my long delay in reacting to your message below. This addition
> to VariantAnnotation by Michael and you is great. I just would like to
> make a comment and report what looks to me as a bug related to the
> computation of cDNA positions, just in case you have suggestions and
> comments, specially with respect to what at the moment seems like a bug
> to me.
>
> the way in which mapCoords() works is by finding all possible overlaps
> of each GRanges entry, corresponding to a variant. This is strictly not
> necessary once you have gone through locateVariants() and/or
> predictCoding() since these functions already find all possible overlaps
> of each variant to different transcripts and regions. So, to get the
> corresponding annotation of cDNA positions I had to write the following
> function which is slightly more elaborated from what you suggest below:
>
> ## this function assumes that locateVariants() or predictCoding()
> ## have been called so that there is a TXID metadata column
> .cDNAloc <- function(gr, txdb) {
>    if (!"TXID" %in% colnames(mcols(gr)))
>      stop("cDNApos: metadata column 'TXID' not found in input GRanges
> object.")
>
>    exonsbytx <- exonsBy(txdb, by="tx")
>    map <- mapCoords(gr, exonsbytx, eltHits=TRUE)
>    eolap <- map$eltHits
>    qolap <- map$queryHits
>    txids <- rep(names(exonsbytx), elementLengths(exonsbytx))[eolap]
>    res <- gr[qolap]
>    mcols(res) <- append(values(res),
>                         DataFrame(cDNALOC=ranges(map),
>                                   TXID2=as.integer(txids)))
>    ## restrict results to cDNA positions of query TXID
>    res <- res[res$TXID == res$TXID2]
>
>    start(res$cDNALOC)
> }
>
> i'm going to run it with a toy example of a variant that overlaps a
> coding region of 7 transcripts in the positive strand and the three UTR
> region of one transcript in the negative strand.
>
> library(VariantAnnotation)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
>                                 width=c(1, 1)), strand=c("+", "-"))
> gr$TXID <- c(72816L, 73335L)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> loc <- locateVariants(gr, txdb, AllVariants())
> loc$cDNALOC <- .cDNAloc(loc, txdb)
>
> now let's look to the cDNA position of the transcript in the negative
> strand (the last entry of the resulting GRanges object):
>
> GRanges with 1 range and 2 metadata columns:
>        seqnames               ranges strand |      TXID   cDNALOC
>           <Rle>            <IRanges>  <Rle> | <integer> <integer>
>    [1]    chr21 [43522349, 43522349]      - |     73335      2036
>
> and the exonic structure of this transcript
>
> exonsbytx <- exonsBy(txdb, by="tx")
> exonsbytx[["73335"]]
> GRanges with 2 ranges and 3 metadata columns:
>        seqnames               ranges strand |   exon_id   exon_name
> exon_rank
>           <Rle>            <IRanges>  <Rle> | <integer> <character>
> <integer>
>    [1]    chr21 [43528406, 43528644]      - |    260152        <NA>     1
>    [2]    chr21 [43522244, 43524145]      - |    260151        <NA>     2
>
> from this structure one would expect that the reported cDNA position
> would be:
>
> 43524145 - 43522349 + 1 = 1797
>
> instead of 2036.
>
> after investigating a bit where this value may come from, it seems that
> while the strand is being taken into account for the relative position
> within the exon, the widths of upstream exons is wrongly selected (in
> this case there are no upstream exons):
>
> width(exonsbytx[["73335"]])
> [1]  239 1902
>
> 239 + 1796 + 1 = 2036
>
>
> so, my guess is that this must be something to be fixed in mapCoords().
>
>
> best regards!!
>
> robert.
>
>
> On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
>> Hi,
>>
>> Following up on this thread:
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
>>
>> These changes are available in VariantAnnotation 1.11.15:
>>
>> 1) LOCSTART, LOCEND
>>
>> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
>> These are LOCATION-centric coordinates and can be different for each row
>> so I thought these names were more descriptive than REFLOCS (discussed
>> in thread). We have 2 values (start/end) instead of a single column of
>> IRanges() because we can't make an IRanges() with missing values.
>> Technically 'missing' ranges are represented by zero-width ranges but we
>> still need a position; there is no position because there was no overlap.
>>
>> 2) mapCoords(), pmapCoords()
>>
>> These functions are courtesy of Michael. mapCoords() maps ranges onto
>> another set of coordinates. You can map to cds-centric, exon-centric or
>> any other type of coordinate. See ?mapCoords in both GenomicRanges and
>> GenomicAlignments.
>>
>>
>> In the previous thread we discussed added cDNA locations to
>> predictCoding(). I've decided against this because it adds the
>> additional overhead of the exonsBy() extraction and a findOverlaps()
>> call. Not all users want the cDNA locations and those that do can now
>> easily get them with mapCoords().
>>
>> ## The usual predictCoding setup:
>> library(BSgenome.Hsapiens.UCSC.hg19)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>> fl <- system.file("extdata", "chr22.vcf.gz",
>> package="VariantAnnotation")
>> vcf <- readVcf(fl, "hg19")
>> vcf <- renameSeqlevels(vcf, "chr22")
>> coding <- predictCoding(vcf, txdb, Hsapiens)
>>
>> ## Exon-centric or cDNA locations:
>> exonsbytx <- exonsBy(txdb, "tx")
>> cDNA <- mapCoords(coding[!duplicated(ranges(coding))], exonsbytx)
>> coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
>>
>> Let me know if you run into problems or if the docs need more detail.
>>
>> Valerie
>>
>