[BioC] matrix like object with Rle columns

Hervé Pagès hpages at fhcrc.org
Thu Jun 28 19:58:39 CEST 2012


Hi Michael,

On 06/27/2012 01:58 PM, Michael Lawrence wrote:
>
>
> On Wed, Jun 27, 2012 at 1:37 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Kasper,
>
>     On 06/25/2012 08:56 PM, Kasper Daniel Hansen wrote:
>     [...]
>
>         [ side question which could be relevant in this discussion: for a
>         numeric Rle is there some notion of precision - say I have truly
>         numeric values with tons of digits, and I want to consider two
>         numbers
>         part of the same run if |x1 -x2|<epsilon? ]
>
>
>     The comparison of 2 doubles is done at the C level with ==, which
>     AFAIK is the same as doing == in R (as long as we deal with non-NA
>     and non-NaN values). See the _fill_Rle_slots_with_double___vals() helper
>     function in IRanges/src/Rle_class.c for the details.
>
>     Therefore:
>
>       > all.equal(sqrt(3)^2, 3)
>       [1] TRUE
>       > sqrt(3)^2 == 3
>       [1] FALSE
>       > Rle(c(sqrt(3)^2, 3))
>       'numeric' Rle of length 2 with 2 runs
>         Lengths: 1 1
>         Values : 3 3
>
>     Note that base::rle() does the same:
>
>       > rle(c(sqrt(3)^2, 3))
>       Run Length Encoding
>         lengths: int [1:2] 1 1
>         values : num [1:2] 3 3
>
>     I can see that using a "|x1 -x2|<epsilon" criteria would in general
>     give better compression (less runs) but then the compression would not
>     be lossless as it is right now:
>
>       > x <- c(sqrt(3)^2, 3)
>       > identical(as.vector(Rle(x)), x)
>       [1] TRUE
>       > identical(inverse.rle(rle(x)), x)
>       [1] TRUE
>
>     Also the "|x1 -x2|<epsilon" approach would introduce some subtle
>     complications due to the fact that the criteria is not transitive
>     anymore i.e. you can have |x1 -x2|<epsilon and |x2 -x3|<epsilon,
>     without having |x1 -x3|<epsilon. Because of that, finding the runs
>     becomes some kind of clustering problem with several possible
>     strategies, some of them very simple but not necessarily with
>     the "good properties".
>
>
> One simple "clustering" would be to round to some fixed level of
> precision. One could multiple by some power of 10 and coerce to integer
> to avoid any floating point issues.

Like for example Rle(round(x, digits=4)). If people feel that this
would be useful, we could add the 'digits' arg to the Rle() constructor
so the rounding is taken care of by the constructor itself. With default
to NA for no rounding at all (like now), so the good properties are
preserved e.g. lossless compression and the fact that unique,
duplicated, is.unsorted, sort, order, rank etc (anything involving
comparison between doubles) will behave exactly the same way on x
and Rle(x) (there is code around that relies on such behavior).

Also maybe we could consider doing signif() instead of round().

Cheers,
H.

>
>     H.
>
>
>
>         Kasper
>
>
>             Michael
>
>             On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen
>             <kasperdanielhansen at gmail.com
>             <mailto:kasperdanielhansen at gmail.com>> wrote:
>
>
>                 Do we have a matrix-like object, but where the columns
>                 are Rle's?
>
>                 Kasper
>
>                 _________________________________________________
>                 Bioconductor mailing list
>                 Bioconductor at r-project.org
>                 <mailto:Bioconductor at r-project.org>
>                 https://stat.ethz.ch/mailman/__listinfo/bioconductor
>                 <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>                 Search the archives:
>                 http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>                 <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list