[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived

Tue Jul 29 05:30:44 CEST 2014

OK. We'll try it again. See 1.17.29.

It looks like I did not introduce this with ignore.strand. The bug was 
deeper and could be handled in either .orderElementsByTranscription() or 
the pair of .listCumsum* functions. I chose to add an arg to 
.orderElementsByTranscription() that controls whether negative strand 
ranges are ordered large-to-small (3' to 5') or small-to-large (5' to 3').

Valerie

On 07/28/2014 01:23 PM, Robert Castelo wrote:
> hi Valerie,
>
> i'm still not getting the right answer, i've looked at the unit test you
> added and i'm afraid it does not recreate the situation i described
> below, which involved one transcript with two exons, while the unit test
> forms two transcripts of one exon. so as it is now, doing:
>
>   mapCoords(gr, grl, ignore.strand=TRUE)
>  > map
> GRanges with 2 ranges and 2 metadata columns:
>        seqnames       ranges strand | queryHits subjectHits
>           <Rle>    <IRanges>  <Rle> | <integer>   <integer>
>    [1]     chrA [1797, 1797]      - |         1           2
>    [2]     chrA [1797, 1797]      - |         2           2
>
> you get the right answer, but doing:
>
> mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE)
>  > map
> GRanges with 2 ranges and 2 metadata columns:
>        seqnames       ranges strand | queryHits subjectHits
>           <Rle>    <IRanges>  <Rle> | <integer>   <integer>
>    [1]     chrA [2036, 2036]      - |         1           1
>    [2]     chrA [2036, 2036]      - |         2           1
>
> you get the previous wrong answer. you'll find my sessionInfo() below.
>
> cheers,
> robert.
> ps: sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets methods
> base
>
> other attached packages:
>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13
> AnnotationDbi_1.27.9
>   [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31
>   [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28
> [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2
> [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3
> [16] colorout_1.0-2
>
> loaded via a namespace (and not attached):
>   [1] BatchJobs_1.3            BBmisc_1.7 BiocParallel_0.7.9
> biomaRt_2.21.1
>   [5] bitops_1.0-6             brew_1.0-6 BSgenome_1.33.8
> checkmate_1.2
>   [9] codetools_0.2-8          DBI_0.2-7 digest_0.6.4             fail_1.2
> [13] foreach_1.4.2            GenomicAlignments_1.1.22
> iterators_1.0.7          Rcpp_0.11.2
> [17] RCurl_1.95-4.1           RSQLite_0.11.4 rtracklayer_1.25.13
> sendmailR_1.1-2
> [21] stats4_3.1.1             stringr_0.6.2 tools_3.1.1
> XML_3.98-1.1
> [25] zlibbioc_1.11.1
>
>
> On 7/28/14 1:25 PM, Valerie Obenchain wrote:
>> Hi Robert,
>  >
>  > Now fixed in GenomicRanges 1.17.28.
>  >
>  > I introduced this bug when trying to be too clever with
>  > ignore.strand. Thanks for catching it and, as always, providing a
>  > clear reproducible example.
>  >
>  > Valerie
>  >
>  > On 07/24/2014 08:24 AM, Robert Castelo wrote:
>  >> Dear Valerie,
>  >>
>  >> sorry for my long delay in reacting to your message below. This
>  >> addition to VariantAnnotation by Michael and you is great. I just
>  >> would like to make a comment and report what looks to me as a bug
>  >> related to the computation of cDNA positions, just in case you have
>  >> suggestions and comments, specially with respect to what at the
>  >> moment seems like a bug to me.
>  >>
>  >> the way in which mapCoords() works is by finding all possible
>  >> overlaps of each GRanges entry, corresponding to a variant. This is
>  >> strictly not necessary once you have gone through locateVariants()
>  >> and/or predictCoding() since these functions already find all
>  >> possible overlaps of each variant to different transcripts and
>  >> regions. So, to get the corresponding annotation of cDNA positions
>  >> I had to write the following function which is slightly more
>  >> elaborated from what you suggest below:
>  >>
>  >> ## this function assumes that locateVariants() or predictCoding()
>  >> ## have been called so that there is a TXID metadata column
>  >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in%
>  >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not
>  >> found in input GRanges object.")
>  >>
>  >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx,
>  >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <-
>  >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <-
>  >> gr[qolap] mcols(res) <- append(values(res),
>  >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ##
>  >> restrict results to cDNA positions of query TXID res <-
>  >> res[res$TXID == res$TXID2]
>  >>
>  >> start(res$cDNALOC) }
>  >>
>  >> i'm going to run it with a toy example of a variant that overlaps
>  >> a coding region of 7 transcripts in the positive strand and the
>  >> three UTR region of one transcript in the negative strand.
>  >>
>  >> library(VariantAnnotation)
>  >> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>  >>
>  >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
>  >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L)
>  >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr,
>  >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb)
>  >>
>  >> now let's look to the cDNA position of the transcript in the
>  >> negative strand (the last entry of the resulting GRanges object):
>  >>
>  >> GRanges with 1 range and 2 metadata columns: seqnames
>  >> ranges strand |      TXID   cDNALOC <Rle>            <IRanges>
>  >> <Rle> | <integer> <integer> [1] chr21 [43522349, 43522349]
>  >> - |     73335      2036
>  >>
>  >> and the exonic structure of this transcript
>  >>
>  >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges
>  >> with 2 ranges and 3 metadata columns: seqnames               ranges
>  >> strand |   exon_id   exon_name exon_rank <Rle>            <IRanges>
>  >> <Rle> | <integer> <character> <integer> [1]    chr21 [43528406,
>  >> 43528644]      - |    260152        <NA>     1 [2]    chr21
>  >> [43522244, 43524145]      - |    260151 <NA>     2
>  >>
>  >> from this structure one would expect that the reported cDNA
>  >> position would be:
>  >>
>  >> 43524145 - 43522349 + 1 = 1797
>  >>
>  >> instead of 2036.
>  >>
>  >> after investigating a bit where this value may come from, it seems
>  >> that while the strand is being taken into account for the relative
>  >> position within the exon, the widths of upstream exons is wrongly
>  >> selected (in this case there are no upstream exons):
>  >>
>  >> width(exonsbytx[["73335"]]) [1]  239 1902
>  >>
>  >> 239 + 1796 + 1 = 2036
>  >>
>  >>
>  >> so, my guess is that this must be something to be fixed in
>  >> mapCoords().
>  >>
>  >>
>  >> best regards!!
>  >>
>  >> robert.
>  >>
>  >>
>  >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
>  >>> Hi,
>  >>>
>  >>> Following up on this thread:
>  >>>
>  >>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
>  >>>
>  >>>
>  >>>
> These changes are available in VariantAnnotation 1.11.15:
>>>>
>  >>> 1) LOCSTART, LOCEND
>  >>>
>  >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
>  >>> These are LOCATION-centric coordinates and can be different for
>  >>> each row so I thought these names were more descriptive than
>  >>> REFLOCS (discussed in thread). We have 2 values (start/end)
>  >>> instead of a single column of IRanges() because we can't make an
>  >>> IRanges() with missing values. Technically 'missing' ranges are
>  >>> represented by zero-width ranges but we still need a position;
>  >>> there is no position because there was no overlap.
>  >>>
>  >>> 2) mapCoords(), pmapCoords()
>  >>>
>  >>> These functions are courtesy of Michael. mapCoords() maps ranges
>  >>> onto another set of coordinates. You can map to cds-centric,
>  >>> exon-centric or any other type of coordinate. See ?mapCoords in
>  >>> both GenomicRanges and GenomicAlignments.
>  >>>
>  >>>
>  >>> In the previous thread we discussed added cDNA locations to
>  >>> predictCoding(). I've decided against this because it adds the
>  >>> additional overhead of the exonsBy() extraction and a
>  >>> findOverlaps() call. Not all users want the cDNA locations and
>  >>> those that do can now easily get them with mapCoords().
>  >>>
>  >>> ## The usual predictCoding setup:
>  >>> library(BSgenome.Hsapiens.UCSC.hg19)
>  >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <-
>  >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata",
>  >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl,
>  >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <-
>  >>> predictCoding(vcf, txdb, Hsapiens)
>  >>>
>  >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb,
>  >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))],
>  >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
>  >>>
>  >>> Let me know if you run into problems or if the docs need more
>  >>> detail.
>  >>>
>  >>> Valerie
>  >>>
>  >>
>  >
>
>