[R] Best practice: to factor or not to factor for float variables

David Winsemius dwinsemius at comcast.net
Fri Jul 4 19:18:48 CEST 2014


Keep as numeric and group with cut(), Hmisc::cut2, or findInterval. The beauty of the functional language design is that you do not need to create a new factor variable.

-- 
David

Sent from my iPhone

> On Jul 4, 2014, at 8:33 AM, Hadley Wickham <h.wickham at gmail.com> wrote:
> 
> Why not just round the floating point numbers to ensure they're equal
> with zapsmall, round or signif?
> 
> Hadley
> 
> On Fri, Jul 4, 2014 at 4:04 AM, Sebastian Schubert
> <schubert.seb at gmail.com> wrote:
>> Hi,
>> 
>> I would like to ask for best practice advice on the design of data
>> structure and the connected analysis techniques.
>> 
>> In my particular case, I have measurements of several variables at
>> several, sometimes equal, heights. Following the tidy data approach of
>> Hadley Wickham, I want to put all data in one data frame. In principle,
>> the height variable is something like a category. For example, I want to
>> average over time for every height. Using dplyr this works very well
>> when my height variable is a factor. However, if it is not a factor the
>> grouping sometimes will not work probably due to numerical issues:
>> 
>> http://stackoverflow.com/questions/24555010/dplyr-and-group-by-factor-vs-no-factor
>> https://github.com/hadley/dplyr/issues/482
>> 
>> Even if the behaviour described in the links above is a bug, on can
>> easily create other numerical issues in R:
>>> (0.1+0.2) == 0.3
>> [1] FALSE
>> 
>> Thus, it seems one should avoid grouping by float values and, in my
>> case, use factors. However, from time to time, I need the numerical
>> character of the heights: compare heights, find the maximum height, etc.
>> Here, the ordered factor approach might help. However, I have to combine
>> (via rbind or merge) different data sets quite often so keeping the
>> order of the different ordered factor heights also seem to be difficult.
>> 
>> Is there any general approach which reduces the work or do I have to
>> switch between approaches as needed?
>> 
>> Thanks a lot for any input,
>> Sebastian
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> -- 
> http://had.co.nz/
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list