[BioC] matrix like object with Rle columns

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Wed Jun 27 21:54:04 CEST 2012


On Wed, Jun 27, 2012 at 3:30 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi guys,
>
> Note that some of the things in the "matrix API" seem to work on
> standard data frames:
>
>> df <- data.frame(aa=1:5, bb=100)
>> rowSums(df)
> [1] 101 102 103 104 105
>> colSums(df)
>  aa  bb
>  15 500
>> max(df)
> [1] 100
>> min(df)
> [1] 1
>> range(df)
> [1]   1 100
>> df + df
>  aa  bb
> 1  2 200
> 2  4 200
> 3  6 200
> 4  8 200
> 5 10 200
>> df <= 3
>        aa    bb
> [1,]  TRUE FALSE
> [2,]  TRUE FALSE
> [3,]  TRUE FALSE
> [4,] FALSE FALSE
> [5,] FALSE FALSE
>
> etc...
>
> But none of them work on DataFrame. Maybe if they were we wouldn't need
> RleMatrix? Using DataFrame instead of RleMatrix would be nice because it
> reuses what we already have. It would also avoid the pitfall of having
> the length of an RleMatrix not being representable with a 32-bit int
> when let's say the nb of rows is 800M and there are a few nb of cols
> (like in Kasper's use case). No need to wait for Luke's "big vector"
> hack.

This is totally fine with me, as long as coercion from Rle to a normal
vector is avoided.

But it might make sense to have a derivative class ensuring that all
columns are numeric in nature.

Kasper

>
> Cheers,
> H.
>
>
> On 06/27/2012 10:46 AM, Jeff Leek wrote:
>>
>> I would love/use all the time this feature if it existed.
>>
>> Jeff
>>
>> On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>>
>>> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>>> One comment:  since matrix is a vector with a dim attribute I see that
>>>> the natural parallel is doing the same for Rle.
>>>
>>>
>>>
>>> Right, in the original plan, the Array class would bring the dim
>>> attribute,
>>> and RleMatrix would contain both Matrix and Rle.
>>>
>>>
>>>>  Nevertheless, that
>>>> would put an upper limit on the number of runLengths in the entire
>>>> matrix.  My impression (which could be wrong) is that we would need to
>>>> implement essentially all matrix-like numeric operations from scratch
>>>> anyway, so it may be worthwhile to consider using a list of Rle's
>>>> where each Rle is a column, instead of a single Rle to represent all
>>>> columns.  Clearly that depends on implementation details, but if we
>>>> really need to do everything from scratch, a list of columns might be
>>>> more flexible (and perhaps even easier to code).
>>>>
>>>>
>>> This would make it harder to treat RleMatrix as an Rle (which is a nice
>>> feature of base R matrices). If the problem is the vector length limit,
>>> then I'd rather wait for Luke's fix, which apparently is coming along.
>>>
>>> Kasper
>>>>
>>>>
>>>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence
>>>> <lawrence.michael at gene.com> wrote:
>>>>>
>>>>> Seems like it could be a nice thing to have. Presumably one would
>>>>> create
>>>>
>>>> an
>>>>>
>>>>> Array subclass of Vector that would add a "dim" attribute. Then Matrix
>>>>
>>>> could
>>>>>
>>>>> extend that to constrain dim to length two (unfortunately colliding
>>>>> with
>>>>
>>>> the
>>>>>
>>>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to
>>>>> implement the actual data storage and many of the accelerated methods.
>>>>> As
>>>>> you said, row-oriented methods would be tough.
>>>>>
>>>>> Any takers?
>>>>>
>>>>> Michael
>>>>>
>>>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen
>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen
>>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>>>
>>>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence
>>>>>>> <lawrence.michael at gene.com> wrote:
>>>>>>>>
>>>>>>>> Patrick and I had talked about this a long time ago (essentially
>>>>>>>> putting a
>>>>>>>> "dim" attribute on an Rle), but the closest thing today is a
>>>>
>>>> DataFrame
>>>>>>>>
>>>>>>>> with
>>>>>>>> Rle columns.
>>>>>>>>
>>>>>>>> Use case?
>>>>>>>
>>>>>>>
>>>>>>> Say I have whole-genome data (for example coverage)  on multiple
>>>>>>> samples.  Usually, this is far easier to think of as a matrix (in my
>>>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums()
>>>>>>> etc
>>>>>>> (in fact, probably the whole API from matrixStats).  This is
>>>>>>> especially nice when you have multiple coverage-like tracks on each
>>>>>>> sample, so you could have
>>>>>>>  trackA : genome by samples
>>>>>>>  trackB : genome by samples
>>>>>>>  ...
>>>>>>>
>>>>>>> You could think of this as a SummarizedExperiment, but with
>>>>>>> _extremely_ big matrices in the assay slot.
>>>>>>>
>>>>>>> I want to take advantage of the Rle structure to store the data more
>>>>>>> efficiently and also to do potentially faster computations.
>>>>>>>
>>>>>>> This is actually closer to my use case where I currently use matrices
>>>>>>> with ~30M rows (which works fine), but I would like to expand to
>>>>>>> ~800M
>>>>>>> rows (which would suck a bit).
>>>>>>>
>>>>>>> You could also think of a matrix-like object with Rle columns as an
>>>>>>> alternative sparse matrix structure.  In a typical sparse matrix you
>>>>>>> only store the non-zero entities, here we only store the
>>>>>>> change-points.  Depending on the structure of the matrix this could
>>>>>>> be
>>>>>>> an efficient storage of an otherwise dense matrix.
>>>>>>>
>>>>>>> So essentially, what I want, is to have mathematical operations on
>>>>>>> this object, where I would utilize that I know that all entities are
>>>>>>> numbers so the typical matrix operations makes sense.
>>>>>>>
>>>>>>> [ side question which could be relevant in this discussion: for a
>>>>>>> numeric Rle is there some notion of precision - say I have truly
>>>>>>> numeric values with tons of digits, and I want to consider two
>>>>>>> numbers
>>>>>>> part of the same run if |x1 -x2|<epsilon? ]
>>>>>>
>>>>>>
>>>>>> You can see that Pete has had similar thoughts in
>>>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is
>>>>>> the easy one).
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>> Kasper
>>>>>>>
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen
>>>>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do we have a matrix-like object, but where the columns are Rle's?
>>>>>>>>>
>>>>>>>>> Kasper
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list