[BioC] IRanges: Request for a "step" argument in runsum

Hervé Pagès hpages at fhcrc.org
Fri Jul 8 10:47:54 CEST 2011


Hi Michael,

On 11-06-29 03:29 PM, Michael Lawrence wrote:
>
>
> 2011/6/28 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Michael, Arnaud,
>
>
>     On 11-05-25 08:21 AM, Arnaud Amzallag wrote:
>
>         Thank you Michael,
>
>         for some email filters reasons I saw your reply only now.
>
>         I recon that in my case that would have been much smoother if
>         sum() would
>         call viewSums() by default and I agree that "It's more intuitive
>         to think of
>         an RleViews like an RleList rather than an IntegerList.". I
>         would support
>         that change.
>
>
>     I've recently made some changes in that direction in IRanges devel.
>     This subtle shift in the nature of an RleViews (from IntegerList
>     to RleList) has some deep consequences on the hierarchy of classes
>     in IRanges. The most important one is that the Views class doesn't
>     extend the IRanges class anymore. Views is now a direct subclass
>     of List. This means that you cannot directly manipulate a Views object
>     as if it was an IRanges object but you must extract its ranges (with
>     ranges()) first.
>
>
> This is pretty unfortunate. We lose the ability to e.g. store the
> coverage inside RangedData (because a ViewsList is a RangesList). Is it
> really necessary for what was proposed?

Because of the inheritance tree. When Views was a subclass of IRanges we
had:

   RleViews -> Views -> IRanges -> IntegerList -> List

but [[ on a Views object would not return an integer vector like it does
on an IRanges object (and the reason IRanges is a child of IntegerList
is precisely because of what [[ does on an IRanges object).

So a Views object should not be seen as an IntegerList object, but just
as a List object with an elementType that depends on the particular
type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on
RleViews can return an Rle without contradiction.

H.

>
>
>     This is a work-in-progress and a few things still need to be polished.
>
>     Also I agree that we should just have "min", "max", "sum", "mean" etc
>     methods for Views objects. No need for viewMins, viewMaxs, viewSums etc
>     I'll change this.
>
>     Cheers,
>     H.
>
>
>         Also it is possible that before I was summing the values of the
>         Rle and did
>         not notice the difference because my Rle was made of a lot of
>         very short Rle
>         lengths.
>
>         Arnaud
>
>         On Tue, May 10, 2011 at 8:44 AM, Michael
>         Lawrence<lawrence.michael at __gene.com
>         <mailto:lawrence.michael at gene.com>
>
>             wrote:
>
>
>             Good to hear that helped. One might expect sum() to simply
>             call viewSums(),
>             but the semantics are a bit strange here. The reason sum()
>             works on Views is
>             that a Views is a Ranges and thus an IntegerList (where each
>             range encodes a
>             sequence of integers). The weird thing is that the elements
>             of a Views are
>             not the sequence of integers covered but rather the values
>             in the Rle. That
>             everything works as you expected is just a coincidence of
>             dispatch.
>
>             For usability we should probably have max(), min(), and
>             sum() just use
>             viewMaxs, viewMins and viewSums. It's more intuitive to
>             think of an RleViews
>             like an RleList rather than an IntegerList.
>
>
>             On Sun, May 8, 2011 at 1:44 PM, Arnaud
>             Amzallag<arnaud.amzallag at __gmail.com
>             <mailto:arnaud.amzallag at gmail.com>
>
>                 wrote:
>
>
>                 Thank you Michael, the function viewSums was exactly
>                 what I needed !
>
>                 0.014 seconds for viewSums(Views(myrle, ir)) vs 54
>                 seconds for
>                 sum(Views(myrle, ir)) on chr22, one sample. I use this
>                 now instead of of
>                 runsum, no problem of memory, and probably even faster.
>                 for full the genome
>                 on many samples that will surely help. Maybe I should
>                 have read a bit more
>                 about the Views.
>
>                 About the result of runsum, I did see a lot of memory
>                 usage when I split
>                 the process with mclapply. The result is indeed a Rle.
>                 After looking closer,
>                 the resuting Rle has much more runs that the original
>                 one. That makes sense,
>                 because runsum is a kind of smoothing function, and the
>                 resulting signal has
>                 much more levels than the original one.
>
>                 Kind regards,
>
>                 Arnaud
>
>                 On May 6, 2011, at 10:42 PM, Michael Lawrence wrote:
>
>
>
>                 On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag<
>                 arnaud.amzallag at gmail.com
>                 <mailto:arnaud.amzallag at gmail.com>>  wrote:
>
>                     Dear IRanges developers,
>
>                     runsum is a very fast and convenient function to
>                     compute on Rle
>                     coverages, for instance. However when it is run on
>                     several chromosomes and
>                     several samples, it can get very memory intensive.
>                     For instance on human
>                     chromosome 1, it outputs a vector of length 250
>                     millions, so for several
>                     full genomes it is quickly billions of numbers in
>                     memory.
>
>
>                 I would have expected the result to be an Rle, which
>                 would be fairly
>                 memory efficient.
>
>
>                     However, often you don't need a single base
>                     resolution. I wanted to
>                     suggest, if it is possible, to add a parameter by
>                     which one could have the
>                     sliding window to slide by a user defined step,
>                     rather than always "step=1",
>                     as it is now. Such that runsum(myRle, k=1e4, step =
>                     1000) would return the
>                     equivalent of a wig file, for each 10 kilobases of
>                     the genome, without
>                     saturating the memory of the server.
>
>                     I tried with sum(Views(myRle, ir)), it is less
>                     memory intensive but it is
>                     much slower. So that amelioration would give the
>                     best of both worlds, fast
>                     and memory efficient.
>
>
>                 Have you tried viewSums(Views(myRle, ir))?
>
>
>                     kind regards,
>
>                     Arnaud Amzallag
>                     Research Fellow
>                     Mass general Cancer Center / Harvard Medical school
>                     _________________________________________________
>                     Bioconductor mailing list
>                     Bioconductor at r-project.org
>                     <mailto:Bioconductor at r-project.org>
>                     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>                     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>                     Search the archives:
>                     http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>                     <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
>
>                 [[alternative HTML version deleted]]
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list