[R] Best practice: to factor or not to factor for float variables

Sebastian Schubert schubert.seb at gmail.com
Fri Jul 4 21:38:59 CEST 2014

Hi Hadley,

actually, I started with floating point numbers, ensured that the
respective numbers are equal in R but I still got strange behaviour with
dplyr's group_by:


If I had to guess, I would suppose the source of this error somewhere in
the C++ part of dplyr. This happened only on one machine I have
available. Whether this is a bug in dplyr, or in the older machine's
libraries, or not a bug at all, I cannot say. Nonetheless, this
confirmed my feelings about avoiding floating point numbers in this
context and lead me to ask for advice here...


Am 04.07.2014 17:33, schrieb Hadley Wickham:
> Why not just round the floating point numbers to ensure they're equal
> with zapsmall, round or signif?
> Hadley
> On Fri, Jul 4, 2014 at 4:04 AM, Sebastian Schubert
> <schubert.seb at gmail.com> wrote:
>> Hi,
>> I would like to ask for best practice advice on the design of data
>> structure and the connected analysis techniques.
>> In my particular case, I have measurements of several variables at
>> several, sometimes equal, heights. Following the tidy data approach of
>> Hadley Wickham, I want to put all data in one data frame. In principle,
>> the height variable is something like a category. For example, I want to
>> average over time for every height. Using dplyr this works very well
>> when my height variable is a factor. However, if it is not a factor the
>> grouping sometimes will not work probably due to numerical issues:
>> http://stackoverflow.com/questions/24555010/dplyr-and-group-by-factor-vs-no-factor
>> https://github.com/hadley/dplyr/issues/482
>> Even if the behaviour described in the links above is a bug, on can
>> easily create other numerical issues in R:
>>> (0.1+0.2) == 0.3
>> [1] FALSE
>> Thus, it seems one should avoid grouping by float values and, in my
>> case, use factors. However, from time to time, I need the numerical
>> character of the heights: compare heights, find the maximum height, etc.
>> Here, the ordered factor approach might help. However, I have to combine
>> (via rbind or merge) different data sets quite often so keeping the
>> order of the different ordered factor heights also seem to be difficult.
>> Is there any general approach which reduces the work or do I have to
>> switch between approaches as needed?
>> Thanks a lot for any input,
>> Sebastian
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list