[BioC] IRanges::Rle and missing values

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Sat Aug 21 21:47:59 CEST 2010


Thanks a lot for the fix.

Some background.  I have data associated with (very small) genomic
locations, irregularly space and I wanted to use the runmeans
functionality.

Now, for the standard example of Rles: coverage across the genome,
"missing" data is equal to a coverage of zero.  But in my case, zero
is a perfectly fine data value and is quite different from NA which
indicates no data.  So while I would like to calculate running means
with a fixed window size (and hence different number of data points in
each window since they are irregularly spaced) I could not use the
runmeans function, with missing values filled in as zero.

I found a solution to my specific problem which uses the fact that my
problem with the running mean is more about using the right
denominator.  I just create 2 Rle's, one with zeroes and data values
and one with 0 and 1 (1 indicating that there is data) and then the
"right" running mean is the ratio between two running sums.

Since NA's are allowed I think it makes a lot of sense to support them
in the run* suite of functions, but it is not something that is
extremely urgent (to me) (since I found a workaround).

Thanks for the help,
Kasper

On Fri, Aug 20, 2010 at 8:03 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>  Kasper,
> I have addressed these two issues, which were caused by inappropriate
> comparisons using NA_REAL at the C-level for 'numeric' Rle objects. As with
> the runmed function in the stats package, I don't currently support missing
> values in the run* methods for Rle objects. Below is the current behavior in
> IRanges 1.6.15 (BioC 2.6, R-2.11) and IRanges 1.7.21 (BioC 2.7, R-devel). I
> can add support for missing values. Just so I prioritize this, when do you
> encounter missing values in your Rle vectors?
>
>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>
>> tmp
> 'numeric' Rle of length 16 with 7 runs
>  Lengths:  1  3  1  4  1  5  1
>  Values :  1  2  3 NA  2  3  2
>
>> runsum(tmp, 3)
> Error in runsum(tmp, 3) : some values are NA, NaN, +/-Inf
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] IRanges_1.7.21
>
>
>
> Patrick
>
>
> On 8/20/10 9:43 AM, Patrick Aboyoun wrote:
>>
>>  Kasper,
>> I'll take a look into this. The Rle constructor issue seems to be isolated
>> to 'numeric' and 'complex' Rles. I'll have an update out soon.
>>
>>
>> Patrick
>>
>>
>> On 8/20/10 8:53 AM, Kasper Daniel Hansen wrote:
>>>
>>> Would it make sense to allow missing values in Rle objects and also to
>>> incorporate removal of missing values in running summaries (and
>>> possibly other functions)?
>>>
>>> Example:
>>>
>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>> tmp
>>>
>>> 'numeric' Rle of length 16 with 10 runs
>>>   Lengths:  1  3  1  1  1  1  1  1  5  1
>>>   Values :  1  2  3 NA NA NA NA  2  3  2
>>>
>>> Seems like the run of 4 NA's is treated differently
>>>
>>>> runsum(tmp, k = 2)
>>>
>>> 'numeric' Rle of length 15 with 11 runs
>>>   Lengths:  1  2  1  1  1  1  1  1  1  4  1
>>>   Values :  3  4  5 NA NA NA NA NA NA NA NA
>>>
>>> And there is no way to do runsum(..., na.rm = TRUE) like in sum (as
>>> far as I can see).
>>>
>>> Kasper
>>>
>>>> sessionInfo()
>>>
>>> R version 2.12.0 Under development (unstable) (2010-08-20 r52790)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>>>  [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915
>>>  [7] LC_PAPER=en_US.iso885915       LC_NAME=C
>>>  [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] grid      stats     graphics  grDevices datasets  utils     methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] multicore_0.1-3   IRanges_1.7.19    matrixStats_0.2.1
>>> R.methodsS3_1.2.0
>>> [5] ggplot2_0.8.8     proto_0.3-8       reshape_0.8.3     plyr_1.1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.12.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list