[R] Best practice: to factor or not to factor for float variables

Hadley Wickham h.wickham at gmail.com
Fri Jul 4 17:33:34 CEST 2014

Why not just round the floating point numbers to ensure they're equal
with zapsmall, round or signif?


On Fri, Jul 4, 2014 at 4:04 AM, Sebastian Schubert
<schubert.seb at gmail.com> wrote:
> Hi,
> I would like to ask for best practice advice on the design of data
> structure and the connected analysis techniques.
> In my particular case, I have measurements of several variables at
> several, sometimes equal, heights. Following the tidy data approach of
> Hadley Wickham, I want to put all data in one data frame. In principle,
> the height variable is something like a category. For example, I want to
> average over time for every height. Using dplyr this works very well
> when my height variable is a factor. However, if it is not a factor the
> grouping sometimes will not work probably due to numerical issues:
> http://stackoverflow.com/questions/24555010/dplyr-and-group-by-factor-vs-no-factor
> https://github.com/hadley/dplyr/issues/482
> Even if the behaviour described in the links above is a bug, on can
> easily create other numerical issues in R:
>> (0.1+0.2) == 0.3
> [1] FALSE
> Thus, it seems one should avoid grouping by float values and, in my
> case, use factors. However, from time to time, I need the numerical
> character of the heights: compare heights, find the maximum height, etc.
> Here, the ordered factor approach might help. However, I have to combine
> (via rbind or merge) different data sets quite often so keeping the
> order of the different ordered factor heights also seem to be difficult.
> Is there any general approach which reduces the work or do I have to
> switch between approaches as needed?
> Thanks a lot for any input,
> Sebastian
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list