[R] Handling of factors

Thomas Lumley tlumley at u.washington.edu
Wed Jan 21 09:40:26 CET 2009


As a follow-up, I don't see any reason why rle() shouldn't work on factors. There's no ambiguity about what the result should be, and the current implementation in rle() would work on factors if they could get past the pre-test.

        -thomas

On Wed, 21 Jan 2009, Thomas Lumley wrote:

> On Tue, 20 Jan 2009, Stavros Macrakis wrote:
>
>> I'm rather confused by the semantics of factors.
>> 
> <snip actual confusion>
>> 
>> It is all very confusing.  Of course, most of this behavior is
>> documented and is easily determined by experimentation, but it would
>> be easier to learn and teach the language if there were some clear
>> principle underlying all this.  What am I missing?
>> 
>
> No, it really is confusing. The problem is that there are two conflicting clear 
> principles. Factors could be
>
> - integer variables with labels (similar to value labels in Stata/SPSS or C 
> enums)
> - variables that takes on values from a pre-specified set, implemented using 
> integer codes (like Pascal enumerated types).
>
> [In fact, there was historically even a third way to view factors, as way to 
> reduce the memory use of string variables. That's obsolete now.]
>
> That is, the fact that they are small integers can be seen as part of the 
> interface or just as part of the implementation.  It's obvious which one is 
> right, but unfortunately it is differently obvious to different people.
>
> AFAIK there has never been a unified policy on this, dating back before R, so 
> different functions behave differently.  There have been changes in R over the 
> years, mostly in the direction of making factors more like Pascal enumerations.
>
>     -thomas
>
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list