[BioC] IRanges: Request for a "step" argument in runsum

Hervé Pagès hpages at fhcrc.org
Wed Jun 29 01:06:13 CEST 2011


Hi Michael, Arnaud,

On 11-05-25 08:21 AM, Arnaud Amzallag wrote:
> Thank you Michael,
>
> for some email filters reasons I saw your reply only now.
>
> I recon that in my case that would have been much smoother if sum() would
> call viewSums() by default and I agree that "It's more intuitive to think of
> an RleViews like an RleList rather than an IntegerList.". I would support
> that change.

I've recently made some changes in that direction in IRanges devel.
This subtle shift in the nature of an RleViews (from IntegerList
to RleList) has some deep consequences on the hierarchy of classes
in IRanges. The most important one is that the Views class doesn't
extend the IRanges class anymore. Views is now a direct subclass
of List. This means that you cannot directly manipulate a Views object
as if it was an IRanges object but you must extract its ranges (with
ranges()) first.

This is a work-in-progress and a few things still need to be polished.

Also I agree that we should just have "min", "max", "sum", "mean" etc
methods for Views objects. No need for viewMins, viewMaxs, viewSums etc
I'll change this.

Cheers,
H.

>
> Also it is possible that before I was summing the values of the Rle and did
> not notice the difference because my Rle was made of a lot of very short Rle
> lengths.
>
> Arnaud
>
> On Tue, May 10, 2011 at 8:44 AM, Michael Lawrence<lawrence.michael at gene.com
>> wrote:
>
>> Good to hear that helped. One might expect sum() to simply call viewSums(),
>> but the semantics are a bit strange here. The reason sum() works on Views is
>> that a Views is a Ranges and thus an IntegerList (where each range encodes a
>> sequence of integers). The weird thing is that the elements of a Views are
>> not the sequence of integers covered but rather the values in the Rle. That
>> everything works as you expected is just a coincidence of dispatch.
>>
>> For usability we should probably have max(), min(), and sum() just use
>> viewMaxs, viewMins and viewSums. It's more intuitive to think of an RleViews
>> like an RleList rather than an IntegerList.
>>
>>
>> On Sun, May 8, 2011 at 1:44 PM, Arnaud Amzallag<arnaud.amzallag at gmail.com
>>> wrote:
>>
>>> Thank you Michael, the function viewSums was exactly what I needed !
>>>
>>> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for
>>> sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of
>>> runsum, no problem of memory, and probably even faster. for full the genome
>>> on many samples that will surely help. Maybe I should have read a bit more
>>> about the Views.
>>>
>>> About the result of runsum, I did see a lot of memory usage when I split
>>> the process with mclapply. The result is indeed a Rle. After looking closer,
>>> the resuting Rle has much more runs that the original one. That makes sense,
>>> because runsum is a kind of smoothing function, and the resulting signal has
>>> much more levels than the original one.
>>>
>>> Kind regards,
>>>
>>> Arnaud
>>>
>>> On May 6, 2011, at 10:42 PM, Michael Lawrence wrote:
>>>
>>>
>>>
>>> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag<
>>> arnaud.amzallag at gmail.com>  wrote:
>>>
>>>> Dear IRanges developers,
>>>>
>>>> runsum is a very fast and convenient function to compute on Rle
>>>> coverages, for instance. However when it is run on several chromosomes and
>>>> several samples, it can get very memory intensive. For instance on human
>>>> chromosome 1, it outputs a vector of length 250 millions, so for several
>>>> full genomes it is quickly billions of numbers in memory.
>>>>
>>>>
>>> I would have expected the result to be an Rle, which would be fairly
>>> memory efficient.
>>>
>>>
>>>> However, often you don't need a single base resolution. I wanted to
>>>> suggest, if it is possible, to add a parameter by which one could have the
>>>> sliding window to slide by a user defined step, rather than always "step=1",
>>>> as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the
>>>> equivalent of a wig file, for each 10 kilobases of the genome, without
>>>> saturating the memory of the server.
>>>>
>>>> I tried with sum(Views(myRle, ir)), it is less memory intensive but it is
>>>> much slower. So that amelioration would give the best of both worlds, fast
>>>> and memory efficient.
>>>>
>>>>
>>> Have you tried viewSums(Views(myRle, ir))?
>>>
>>>
>>>> kind regards,
>>>>
>>>> Arnaud Amzallag
>>>> Research Fellow
>>>> Mass general Cancer Center / Harvard Medical school
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list