[BioC] matrix like object with Rle columns

Wed Jun 27 21:30:57 CEST 2012

Hi guys,

Note that some of the things in the "matrix API" seem to work on
standard data frames:

 > df <- data.frame(aa=1:5, bb=100)
 > rowSums(df)
[1] 101 102 103 104 105
 > colSums(df)
  aa  bb
  15 500
 > max(df)
[1] 100
 > min(df)
[1] 1
 > range(df)
[1]   1 100
 > df + df
   aa  bb
1  2 200
2  4 200
3  6 200
4  8 200
5 10 200
 > df <= 3
         aa    bb
[1,]  TRUE FALSE
[2,]  TRUE FALSE
[3,]  TRUE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE

etc...

But none of them work on DataFrame. Maybe if they were we wouldn't need
RleMatrix? Using DataFrame instead of RleMatrix would be nice because it
reuses what we already have. It would also avoid the pitfall of having
the length of an RleMatrix not being representable with a 32-bit int
when let's say the nb of rows is 800M and there are a few nb of cols
(like in Kasper's use case). No need to wait for Luke's "big vector"
hack.

Cheers,
H.

On 06/27/2012 10:46 AM, Jeff Leek wrote:
> I would love/use all the time this feature if it existed.
>
> Jeff
>
> On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence
> <lawrence.michael at gene.com> wrote:
>> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>>
>>> One comment:  since matrix is a vector with a dim attribute I see that
>>> the natural parallel is doing the same for Rle.
>>
>>
>> Right, in the original plan, the Array class would bring the dim attribute,
>> and RleMatrix would contain both Matrix and Rle.
>>
>>
>>>   Nevertheless, that
>>> would put an upper limit on the number of runLengths in the entire
>>> matrix.  My impression (which could be wrong) is that we would need to
>>> implement essentially all matrix-like numeric operations from scratch
>>> anyway, so it may be worthwhile to consider using a list of Rle's
>>> where each Rle is a column, instead of a single Rle to represent all
>>> columns.  Clearly that depends on implementation details, but if we
>>> really need to do everything from scratch, a list of columns might be
>>> more flexible (and perhaps even easier to code).
>>>
>>>
>> This would make it harder to treat RleMatrix as an Rle (which is a nice
>> feature of base R matrices). If the problem is the vector length limit,
>> then I'd rather wait for Luke's fix, which apparently is coming along.
>>
>> Kasper
>>>
>>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence
>>> <lawrence.michael at gene.com> wrote:
>>>> Seems like it could be a nice thing to have. Presumably one would create
>>> an
>>>> Array subclass of Vector that would add a "dim" attribute. Then Matrix
>>> could
>>>> extend that to constrain dim to length two (unfortunately colliding with
>>> the
>>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to
>>>> implement the actual data storage and many of the accelerated methods. As
>>>> you said, row-oriented methods would be tough.
>>>>
>>>> Any takers?
>>>>
>>>> Michael
>>>>
>>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen
>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>
>>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen
>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence
>>>>>> <lawrence.michael at gene.com> wrote:
>>>>>>> Patrick and I had talked about this a long time ago (essentially
>>>>>>> putting a
>>>>>>> "dim" attribute on an Rle), but the closest thing today is a
>>> DataFrame
>>>>>>> with
>>>>>>> Rle columns.
>>>>>>>
>>>>>>> Use case?
>>>>>>
>>>>>> Say I have whole-genome data (for example coverage)  on multiple
>>>>>> samples.  Usually, this is far easier to think of as a matrix (in my
>>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums() etc
>>>>>> (in fact, probably the whole API from matrixStats).  This is
>>>>>> especially nice when you have multiple coverage-like tracks on each
>>>>>> sample, so you could have
>>>>>>   trackA : genome by samples
>>>>>>   trackB : genome by samples
>>>>>>   ...
>>>>>>
>>>>>> You could think of this as a SummarizedExperiment, but with
>>>>>> _extremely_ big matrices in the assay slot.
>>>>>>
>>>>>> I want to take advantage of the Rle structure to store the data more
>>>>>> efficiently and also to do potentially faster computations.
>>>>>>
>>>>>> This is actually closer to my use case where I currently use matrices
>>>>>> with ~30M rows (which works fine), but I would like to expand to ~800M
>>>>>> rows (which would suck a bit).
>>>>>>
>>>>>> You could also think of a matrix-like object with Rle columns as an
>>>>>> alternative sparse matrix structure.  In a typical sparse matrix you
>>>>>> only store the non-zero entities, here we only store the
>>>>>> change-points.  Depending on the structure of the matrix this could be
>>>>>> an efficient storage of an otherwise dense matrix.
>>>>>>
>>>>>> So essentially, what I want, is to have mathematical operations on
>>>>>> this object, where I would utilize that I know that all entities are
>>>>>> numbers so the typical matrix operations makes sense.
>>>>>>
>>>>>> [ side question which could be relevant in this discussion: for a
>>>>>> numeric Rle is there some notion of precision - say I have truly
>>>>>> numeric values with tons of digits, and I want to consider two numbers
>>>>>> part of the same run if |x1 -x2|<epsilon? ]
>>>>>
>>>>> You can see that Pete has had similar thoughts in
>>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is
>>>>> the easy one).
>>>>>
>>>>> Kasper
>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen
>>>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>>>>
>>>>>>>> Do we have a matrix-like object, but where the columns are Rle's?
>>>>>>>>
>>>>>>>> Kasper
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319