[R] Why do we have to turn factors into characters for various functions?

Petr PIKAL petr.pikal at precheza.cz
Mon Dec 13 08:50:56 CET 2010


Hi

r-help-bounces at r-project.org napsal dne 12.12.2010 21:00:37:

> At 12.12.2010 00:48 +0200, Tal Galili wrote:
> >Hello dear R-help mailing list,
> >
> >My question is *not* about how factors are implemented in R (which is, 
if I
> >understand correctly, that factors keeps numbers and assign levels to 
them).
> >My question *is* about why so many functions that work on factors don't
> >treat them as characters by default?
> >
> >Here are two simple examples:
> >Example one turning the characters inside a factor into numeric:
> >
> >x <- factor(4:6)
> >as.numeric(x) # output: 1 2 3
> >as.numeric(as.character(x)) # output: 4 5 6  # isn't this what we 
wanted?
> >
> >
> >Example two, using strsplit on a factor:
> >
> >x <- factor(paste(letters[4:6], 4:6, sep="A"))
> >strsplit(x, "A") # will result in an error:  # Error in strsplit(x, 
"A") :
> >non-character argument
> >strsplit(as.character(x), "A") # will work and split
> >
> >
> >So what is the reason this is the case?
> >Is it that implementing a switch of factors to characters as the 
default in
> >some of the basic function will cause old code to break?
> >Is it a better design in some other way?
> >
> >I am curious to know the reason for this.
> 
> In my view the answer can be found implicitly in the language 
definition.
> 
> "Factors are currently implemented using an integer array to specify 
> the actual levels and a second array of names that are mapped to the 
> integers. Rather unfortunately users often make use of the 
> implementation in order to make some calculations easier."
> 
> It is the "unfortunate" use of factors that seems generally accepted, 
> even if the language definition continues:
> 
> "This, however, is an implementation issue and is not guaranteed to 
> hold in all implementations of R."
> 
> Personally, like some others, I avoid factors, except in cases, where 
> they represent a statistical concept.

On contrary I find factors quite useful. Consider possibility to change 
its levels

> set.seed(111)
> x <- factor(sample(1:4, 20, replace=T), labels=c("one", "two", "three", 
"four"))
> x
 [1] three three two   three two   two   one   three two   one   three 
three
[13] one   one   one   two   one   four  two   three
Levels: one two three four
> levels(x)[3:4] <- "more"
> x
 [1] more more two  more two  two  one  more two  one  more more one  one 
one 
[16] two  one  more two  more
Levels: one two more

I believe that if x is character, it can be also done but factor way seems 
to me more convenient. I also use point distinction in plots by 
pch=as.numeric(some.factor) quite often. 

Anyway it is maybe more about personal habits than about bad factor 
"features"
 
Regards
Petr

> 
> Certainly I would agree with you that, if only reading the "R 
> Language Definition" and not the documentation of the function 
> factor, one would rather expect functions like as.numeric or strsplit 
> to operate on the levels of a factor and not on the underlying, 
> implementation specific, integer array.
> 
> Heinz
> 
> 
> 
> >Thank you for your reading,
> >Tal
> >
> >----------------Contact
> >Details:-------------------------------------------------------
> >Contact me: Tal.Galili at gmail.com |  972-52-7275845
> >Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) 
|
> >www.r-statistics.com (English)
> 
>----------------------------------------------------------------------------------------------
> >
> >         [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help at r-project.org mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list