[BioC] IRanges: Request for a "step" argument in runsum

Hervé Pagès hpages at fhcrc.org
Fri Jul 8 21:11:58 CEST 2011


On 11-07-08 09:58 AM, Michael Lawrence wrote:
>
>
> 2011/7/8 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Michael,
>
>     On 11-06-29 03:29 PM, Michael Lawrence wrote:
>
>
>
>         2011/6/28 Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>>>
>
>
>             Hi Michael, Arnaud,
>
>
>             On 11-05-25 08:21 AM, Arnaud Amzallag wrote:
>
>                 Thank you Michael,
>
>                 for some email filters reasons I saw your reply only now.
>
>                 I recon that in my case that would have been much
>         smoother if
>                 sum() would
>                 call viewSums() by default and I agree that "It's more
>         intuitive
>                 to think of
>                 an RleViews like an RleList rather than an IntegerList.". I
>                 would support
>                 that change.
>
>
>             I've recently made some changes in that direction in IRanges
>         devel.
>             This subtle shift in the nature of an RleViews (from IntegerList
>             to RleList) has some deep consequences on the hierarchy of
>         classes
>             in IRanges. The most important one is that the Views class
>         doesn't
>             extend the IRanges class anymore. Views is now a direct subclass
>             of List. This means that you cannot directly manipulate a
>         Views object
>             as if it was an IRanges object but you must extract its
>         ranges (with
>             ranges()) first.
>
>
>         This is pretty unfortunate. We lose the ability to e.g. store the
>         coverage inside RangedData (because a ViewsList is a
>         RangesList). Is it
>         really necessary for what was proposed?
>
>
>     Because of the inheritance tree. When Views was a subclass of IRanges we
>     had:
>
>       RleViews -> Views -> IRanges -> IntegerList -> List
>
>     but [[ on a Views object would not return an integer vector like it does
>     on an IRanges object (and the reason IRanges is a child of IntegerList
>     is precisely because of what [[ does on an IRanges object).
>
>     So a Views object should not be seen as an IntegerList object, but just
>     as a List object with an elementType that depends on the particular
>     type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on
>     RleViews can return an Rle without contradiction.
>
>
> What about breaking the IRanges -> IntegerList inheritance?
> Conceptually, it's nice to think of Ranges as being an IntegerList, but
> practically it's much nicer having a Views be a Ranges. It's convenient
> to think of a Views being a set of ranges with a subject vector, and
> being able to treat it transparently as such was a really nice feature
> of IRanges. This change will break a lot of code.

Most of the times we will still be able to treat a Views object as an
IRanges object because I re-implemented most of the methods in the
IRanges interface to make them work on Views. It's a very light kind
of re-implementation that basically delegates to the method for
IRanges e.g.:

   setMethod("width", "Views", function(x) width(ranges(x)))

I did this for a small subset of methods in the IRanges interface and
so far it didn't seem to break anything (except the chipseq package,
but it's easy to fix, will do).

I found the price to pay for having a cleaner class hierarchy
acceptable.

>
> I could see someone wanting to implement an IntegerList with a Ranges
> (when dealing with integer sequences), but what sort of practical use
> cases are there for treating a Ranges as an IntegerList?

For the sake of clarity I simplified the inheritance diagram above
but IRanges doesn't inherit directly from IntegerList, it does so via
the Ranges class:

   IRanges -> Ranges -> IntegerList -> List

Problem is there are other classes like PartitioningByEnd that inherit
from Ranges that also benefit from being seen (and treated) as
IntegerList objects (because of what [[ does on them).

>
> I admit it's not so nice to modify Ranges to accommodate Views. Can't we
> just live with the contradiction? At least as far as this thread is
> concerned, add a smarter sum() on Views was all that was needed.

I actually started to modify Views before I saw this thread. Here is the
original motivation:

hpages at latitude:~$ svn log -r 55988 
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/IRanges
------------------------------------------------------------------------
r55988 | hpages at fhcrc.org | 2011-06-04 00:20:43 -0700 (Sat, 04 Jun 2011) 
| 10 lines

Views is now a direct subclass of List and not a direct subclass of IRanges
anymore (the ranges of the views are now stored in a new "ranges" slot 
which is
of type IRanges). Before this change we had: RleViews -> Views -> IRanges ->
Ranges -> IntegerList -> AtomicList -> List. So any RleViews object (e.g.
'character'-RleViews) was regarded as an IntegerList object which was
conceptually wrong (and leaded to some schizophrenic decisions about 
what the
elementType of the various Views objects should be). After this change 
we just
have: RleViews -> Views -> List so the problem is gone.
This change should be mostly transparent to the end user.

------------------------------------------------------------------------

H.

>
> Michael
>
>     H.
>
>
>
>             This is a work-in-progress and a few things still need to be
>         polished.
>
>             Also I agree that we should just have "min", "max", "sum",
>         "mean" etc
>             methods for Views objects. No need for viewMins, viewMaxs,
>         viewSums etc
>             I'll change this.
>
>             Cheers,
>             H.
>
>
>                 Also it is possible that before I was summing the values
>         of the
>                 Rle and did
>                 not notice the difference because my Rle was made of a
>         lot of
>                 very short Rle
>                 lengths.
>
>                 Arnaud
>
>                 On Tue, May 10, 2011 at 8:44 AM, Michael
>                 Lawrence<lawrence.michael at __ge__ne.com <http://gene.com>
>         <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>
>
>
>                     wrote:
>
>
>                     Good to hear that helped. One might expect sum() to
>         simply
>                     call viewSums(),
>                     but the semantics are a bit strange here. The reason
>         sum()
>                     works on Views is
>                     that a Views is a Ranges and thus an IntegerList
>         (where each
>                     range encodes a
>                     sequence of integers). The weird thing is that the
>         elements
>                     of a Views are
>                     not the sequence of integers covered but rather the
>         values
>                     in the Rle. That
>                     everything works as you expected is just a
>         coincidence of
>                     dispatch.
>
>                     For usability we should probably have max(), min(), and
>                     sum() just use
>                     viewMaxs, viewMins and viewSums. It's more intuitive to
>                     think of an RleViews
>                     like an RleList rather than an IntegerList.
>
>
>                     On Sun, May 8, 2011 at 1:44 PM, Arnaud
>                     Amzallag<arnaud.amzallag at __gma__il.com
>         <http://gmail.com>
>         <mailto:arnaud.amzallag at gmail.__com
>         <mailto:arnaud.amzallag at gmail.com>>
>
>
>                         wrote:
>
>
>                         Thank you Michael, the function viewSums was exactly
>                         what I needed !
>
>                         0.014 seconds for viewSums(Views(myrle, ir)) vs 54
>                         seconds for
>                         sum(Views(myrle, ir)) on chr22, one sample. I
>         use this
>                         now instead of of
>                         runsum, no problem of memory, and probably even
>         faster.
>                         for full the genome
>                         on many samples that will surely help. Maybe I
>         should
>                         have read a bit more
>                         about the Views.
>
>                         About the result of runsum, I did see a lot of
>         memory
>                         usage when I split
>                         the process with mclapply. The result is indeed
>         a Rle.
>                         After looking closer,
>                         the resuting Rle has much more runs that the
>         original
>                         one. That makes sense,
>                         because runsum is a kind of smoothing function,
>         and the
>                         resulting signal has
>                         much more levels than the original one.
>
>                         Kind regards,
>
>                         Arnaud
>
>                         On May 6, 2011, at 10:42 PM, Michael Lawrence wrote:
>
>
>
>                         On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag<
>         arnaud.amzallag at gmail.com <mailto:arnaud.amzallag at gmail.com>
>         <mailto:arnaud.amzallag at gmail.__com
>         <mailto:arnaud.amzallag at gmail.com>>>  wrote:
>
>
>                             Dear IRanges developers,
>
>                             runsum is a very fast and convenient function to
>                             compute on Rle
>                             coverages, for instance. However when it is
>         run on
>                             several chromosomes and
>                             several samples, it can get very memory
>         intensive.
>                             For instance on human
>                             chromosome 1, it outputs a vector of length 250
>                             millions, so for several
>                             full genomes it is quickly billions of
>         numbers in
>                             memory.
>
>
>                         I would have expected the result to be an Rle, which
>                         would be fairly
>                         memory efficient.
>
>
>                             However, often you don't need a single base
>                             resolution. I wanted to
>                             suggest, if it is possible, to add a
>         parameter by
>                             which one could have the
>                             sliding window to slide by a user defined step,
>                             rather than always "step=1",
>                             as it is now. Such that runsum(myRle, k=1e4,
>         step =
>                             1000) would return the
>                             equivalent of a wig file, for each 10
>         kilobases of
>                             the genome, without
>                             saturating the memory of the server.
>
>                             I tried with sum(Views(myRle, ir)), it is less
>                             memory intensive but it is
>                             much slower. So that amelioration would give the
>                             best of both worlds, fast
>                             and memory efficient.
>
>
>                         Have you tried viewSums(Views(myRle, ir))?
>
>
>                             kind regards,
>
>                             Arnaud Amzallag
>                             Research Fellow
>                             Mass general Cancer Center / Harvard Medical
>         school
>
>           ___________________________________________________
>                             Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>
>         https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                             Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>
>
>                         [[alternative HTML version deleted]]
>
>
>                 ___________________________________________________
>                 Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>
>         https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                 Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>             --
>             Hervé Pagès
>
>             Program in Computational Biology
>             Division of Public Health Sciences
>             Fred Hutchinson Cancer Research Center
>             1100 Fairview Ave. N, M1-B514
>             P.O. Box 19024
>             Seattle, WA 98109-1024
>
>             E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>             Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>             Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list