[BioC] subset a RleList using GRanges object?

Hervé Pagès hpages at fhcrc.org
Thu Dec 5 02:18:31 CET 2013


Hi Malcolm,

On 12/04/2013 11:43 AM, Cook, Malcolm wrote:
> Janet, et al
>
> I believe there are problems with all approaches mentioned so far on this thread.
>
> When you index something, length of the result should equal the length of the index,

Not if the index is a logical vector or a Ranges object though.

> AND, the order of the elements of the result should correspond to the order of the indices.
>
> The approaches mentioned all depend upon coersion from GRanges to RangesList, but this does not preserve the order of the GRanges.

True but it's guaranteed to preserve the order within each
space/chromosome. I think this is good enough to make subsetting
a named list-like object (e.g. named RleList, named DNAStringSet,
etc...) by a GRanges object a reasonable thing to support.
The order of the seqlevels in the GRanges object doesn't matter
and won't affect the result of the subsetting because they're
matched to the names of the list-like object to subset.

Note that subsetting needs to behave like an "endomorphism" i.e.
the result must be of the same type as the original object. So
when subsetting an RleList by a GRanges index (like in Janet's
example), the result will always be an RleList object, even if
the index has more than 1 range per chromosome (in that case
the extracted ranges are concatenated together). I'm still not
convinced this is what Janet really needs. Like Michael mentioned
earlier, maybe a more useful approach is to create an RleViewsList
object with something like:

   vlist <- Views(rleList=z, rangesList=as(myRange, "RangesList"))
   viewMeans(vlist)

More below.

>
> Look:
>
>> example(GRangesList)
>> as(gr3,'RangesList')
>   IRangesList of length 2
>   $chr1
>   IRanges of length 1
>       start end width
>   [1]     1   3     3
>
>   $chr2
>   IRanges of length 1
>       start end width
>   [1]     4   9     6
>
> !> as(gr3[2:1],'RangesList')
>   IRangesList of length 2
>   $chr1
>   IRanges of length 1
>       start end width
>   [1]     1   3     3
>
>   $chr2
>   IRanges of length 1
>       start end width
>   [1]     4   9     6
>
> I think if you contrive for your GRanges to be in 'canonical order' (as defined by the RangesList coersion), you can proceed with this approach.  Look:
>
>> stopifnot(all.equal(as(sort(gr3[2:1]),'RangesList'),as(sort(gr3[1:2]),'RangesList')))
>
> Janet, you brought this up in
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/36247/focus=36263
>
> with more or less the same outcome.
>
> Here is excerpt from that exchange:
>
>>> An "order" method would be great. Note though that the coercion to
>>> RangesList only sorts by seqlevels, so we would need something that gave
>>> out that order vector. The point here is to avoid forcing the user to
>>> modify the order of the original GRanges.
>>>
>>
>> With order() the user will still be able to achieve this with something
>> like:
>>
>>   regionMeans <- function(regions, cvg)
>>   {
>>       seqlevels(regions) <- names(cvg)
>>       oo <- order(regions)
>>       regions <- regions[oo]
>>       ans <- unlist(
>>                viewMeans(
>>                  Views(cvg, as(regions, "RangesList"))
>>                ), use.names=FALSE)
>>       ans[oo] <- ans  # restore original order
>>       ans
>>   }
>>
>>   values(myRegions)$meanCov <- regionMeans(myRegions, myRleList)
>>
>> Tricky :-/

Very related indeed. Thanks for refreshing my memory. Note that this
discussion happened more than 2 years ago at a time when we didn't have
order() for GRanges yet. Now we have it so my regionMeans() proposal
above should work. FWIW in ?tileGenome (GenomicRanges package), I show
how to implement a similar function (binnedAverage) without the need
for ordering the GRanges object (it uses a split/unsplit aproach
instead). The design of binnedAverage() is also a little bit cleaner.

>
> Let's not forget this lesson!
>
> However, I agree, some syntactic sugar that allows indexing an RleList by GRanges with the proposed semantics would be great.
>
> In the mean time, I offer this function, which I found to be faster than Herve's offering above, and also provides a few more useful (to me) abstractions.
>
> YMMV:
>
>
> applyAt <- function(cvg # an RleList (quite possibly representing a genomic coverage)
>                      ,gr # a GRanges at each element of which you want to evaluate a function
>                      ,FUN=identity # by default, just return individual rleLists for each element of gr
>                      ,...          # additional options to FUN
>                      ,.MAPPLY=mapply #optionally parallize with mcmapply or your favorite variant
>                      ,USE.NAMES=TRUE
>                      ,SIMPLIFY=TRUE #'array'#TRUE
>                      ,MoreArgs=NULL # passed to .MAPPLY
>                      ,ignore.strand=TRUE # else, reverse the coverages
>                      ) {
>    revif<-function(s,x) {
>      if(ignore.strand==TRUE) {x}
>      else if (s=='+')  {x}
>      else {rev(x)}
>    }
>    ans<-.MAPPLY(function(a,b,c,s) {
>        FUN(revif(s,cvg[[a]][IRanges(b,c),drop=FALSE]),...)
>    }
>                 ,as.vector(seqnames(gr)),start(gr),end(gr),as.vector(strand(gr))
>                 ,MoreArgs=MoreArgs
>                 ,SIMPLIFY=SIMPLIFY
>                 ,USE.NAMES=FALSE         #USE.NAMES
>                 )
>    if (!USE.NAMES) {}
>    else if (is.null(dim(ans))) {names(ans)<-names(gr)}
>    else if (SIMPLIFY=='array') {colnames(ans)<-names(gr)}
>    else  rownames(ans)<-names(gr)
>    ans
> }
>
> Finally, what I think we REALLY want is a version of the above which currys a FUN (possibly `identity`) down to rle levels that can be applied to a bigwig file without having to load the whole thing into memory.
>
> In the mean time, I offer:
>
> applyAt.bw<-function(x,y,...) {
>      applyAt(import.bw(x,asRle=TRUE
>                        ,selection=BigWigSelection(y)
>                        )
>              ,y
>              ,...)
> }

Interesting. It sounds to me like maybe one way to facilitate this
(and also avoid the proliferation of apply-like functions) would be
to finally bite the bullet and come up with a new kind of view objects
where the views are *genomic* ranges instead of just ranges (like they
are right now). Something that was already mentioned in this
more-than-2-year-old discussion. We could start with RleListGViews
and BigWigFileGViews objects, both concrete subclasses of virtual
class GViews.

Thanks for you feedback,
H.

>
> ~Malcolm
>
>
>   >-----Original Message-----
>   >From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Janet Young
>   >Sent: Wednesday, December 04, 2013 12:59 PM
>   >To: Hervé Pagès
>   >Cc: Michael Lawrence; bioconductor at r-project.org
>   >Subject: Re: [BioC] subset a RleList using GRanges object?
>   >
>   >Thanks, all.  Yes, that coercion will do it for me:   z[as(myRange, "RangesList")].   It's always nice for the end user when you take care of
>   >those kinds of things behind the scenes for us - sounds like what you're considering implementing might take care of it.
>   >
>   >Robert: thanks for the ideas!  But my needs are a little different, though - I'm keeping the individual values in the region I'm interested
>   >in (I'm plotting RNA-seq coverage on just a single gene, but my coverage object is for the whole genome).
>   >
>   >Janet
>   >
>   >
>   >
>   >On Dec 4, 2013, at 10:09 AM, Hervé Pagès wrote:
>   >
>   >> Hi Michael,
>   >>
>   >> On 12/03/2013 07:13 PM, Michael Lawrence wrote:
>   >>> For now, just coerce to a RangesList.
>   >>>
>   >>> z [ as(myRange, "RangesList") ]
>   >>>
>   >>> Herve might consider adding a [,List,GenomicRanges method... kind of a
>   >>> logical leap though from seqnames to list element names.
>   >>
>   >> This is something I've been considering for a while. I think we should
>   >> do it. The mapping between the seqnames of a GRanges and the names of a
>   >> RangesList is well established at this point e.g. the coercion (back and
>   >> forth) between the two already does that.
>   >>
>   >> Cheers,
>   >> H.
>   >>
>   >>>
>   >>> extractCoverageForPositions is different; it extracts an integer vector
>   >>> given a vector of width-one positions.
>   >>>
>   >>>
>   >>>
>   >>>
>   >>>
>   >>>
>   >>> On Tue, Dec 3, 2013 at 4:53 PM, Janet Young <jayoung at fhcrc.org> wrote:
>   >>>
>   >>>> Hi there,
>   >>>>
>   >>>> I'm playing around with coverage data generated outside of R, planning to
>   >>>> plot RNA-seq coverage for some genes we're interested in.
>   >>>>
>   >>>> I have a request - it'd be nice from my point of view if it were possible
>   >>>> to look at a small region an RleList (or CompressedRleList) using a GRanges
>   >>>> object to focus on that smaller region.  Looks like I can subset a single
>   >>>> Rle object with an IRanges object, but I wonder if this nice feature could
>   >>>> be extended to GRanges objects?
>   >>>>
>   >>>> Some code is below to show what I'm trying to do.
>   >>>>
>   >>>> thanks veruy much,
>   >>>>
>   >>>> Janet
>   >>>>
>   >>>>
>   >>>>
>   >>>> library(GenomicRanges)
>   >>>>
>   >>>> ## an example RleList
>   >>>> x <- Rle(10:1, 1:10)
>   >>>> y <- Rle(10:1, 1:10)
>   >>>> z <- RleList( chr1=x, chr2=y)
>   >>>>
>   >>>> ## an example GRanges object
>   >>>> myRange <- GRanges( seqnames="chr1", ranges=IRanges(start=10,end=15) )
>   >>>>
>   >>>> ## subsetting an Rle using an IRanges object works, as expected:
>   >>>> z[["chr1"]] [ ranges(myRange) ]
>   >>>>
>   >>>> ## but subsetting an RleList by GRanges object doesn't work
>   >>>> z [ myRange ]
>   >>>> # Error in normalizeSingleBracketSubscript(i, x) : invalid subscript type
>   >>>>
>   >>>> sessionInfo()
>   >>>>
>   >>>> R Under development (unstable) (2013-11-06 r64163)
>   >>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>   >>>>
>   >>>> locale:
>   >>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   >>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   >>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   >>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   >>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>   >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>   >>>>
>   >>>> attached base packages:
>   >>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>   >>>> [8] base
>   >>>>
>   >>>> other attached packages:
>   >>>> [1] GenomicRanges_1.15.10 XVector_0.3.2         IRanges_1.21.13
>   >>>> [4] BiocGenerics_0.9.1
>   >>>>
>   >>>> loaded via a namespace (and not attached):
>   >>>> [1] stats4_3.1.0
>   >>>>
>   >>>>
>   >>>>
>   >>>>
>   >>>> -------------------------------------------------------------------
>   >>>>
>   >>>> Dr. Janet Young
>   >>>>
>   >>>> Malik lab
>   >>>> http://research.fhcrc.org/malik/en.html
>   >>>>
>   >>>> Fred Hutchinson Cancer Research Center
>   >>>> 1100 Fairview Avenue N., A2-025,
>   >>>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>   >>>>
>   >>>> tel: (206) 667 4512
>   >>>> email: jayoung  ...at...  fhcrc.org
>   >>>>
>   >>>> -------------------------------------------------------------------
>   >>>>
>   >>>> _______________________________________________
>   >>>> Bioconductor mailing list
>   >>>> Bioconductor at r-project.org
>   >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>   >>>> Search the archives:
>   >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>   >>>>
>   >>>
>   >>> 	[[alternative HTML version deleted]]
>   >>>
>   >>> _______________________________________________
>   >>> Bioconductor mailing list
>   >>> Bioconductor at r-project.org
>   >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>   >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>   >>>
>   >>
>   >> --
>   >> Hervé Pagès
>   >>
>   >> Program in Computational Biology
>   >> Division of Public Health Sciences
>   >> Fred Hutchinson Cancer Research Center
>   >> 1100 Fairview Ave. N, M1-B514
>   >> P.O. Box 19024
>   >> Seattle, WA 98109-1024
>   >>
>   >> E-mail: hpages at fhcrc.org
>   >> Phone:  (206) 667-5791
>   >> Fax:    (206) 667-1319
>   >
>   >_______________________________________________
>   >Bioconductor mailing list
>   >Bioconductor at r-project.org
>   >https://stat.ethz.ch/mailman/listinfo/bioconductor
>   >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list