[R] Best practice: to factor or not to factor for float variables

Fri Jul 4 13:04:29 CEST 2014

Hi,

I would like to ask for best practice advice on the design of data
structure and the connected analysis techniques.

In my particular case, I have measurements of several variables at
several, sometimes equal, heights. Following the tidy data approach of
Hadley Wickham, I want to put all data in one data frame. In principle,
the height variable is something like a category. For example, I want to
average over time for every height. Using dplyr this works very well
when my height variable is a factor. However, if it is not a factor the
grouping sometimes will not work probably due to numerical issues:

http://stackoverflow.com/questions/24555010/dplyr-and-group-by-factor-vs-no-factor
https://github.com/hadley/dplyr/issues/482

Even if the behaviour described in the links above is a bug, on can
easily create other numerical issues in R:
> (0.1+0.2) == 0.3
[1] FALSE

Thus, it seems one should avoid grouping by float values and, in my
case, use factors. However, from time to time, I need the numerical
character of the heights: compare heights, find the maximum height, etc.
Here, the ordered factor approach might help. However, I have to combine
(via rbind or merge) different data sets quite often so keeping the
order of the different ordered factor heights also seem to be difficult.

Is there any general approach which reduces the work or do I have to
switch between approaches as needed?

Thanks a lot for any input,
Sebastian