[R] factor : how does it work ?

Duncan Murdoch murdoch at stats.uwo.ca
Thu Oct 6 15:36:28 CEST 2005


On 10/6/2005 9:14 AM, Florence Combes wrote:
> Dear all,
> 
> I try for long to understand exactly what is the factor type and especially
> how it works, but it seems too difficult for me....
> I read paragraphs about it, and I understand quite well what it is (I think)
> but I still can't figure how to deal with.
> Especially these 2 mysteries (for me) :
> 
> 1st when I make a dataframe (with the as.data.frame() or the data.frame()
> commands) from vectors, it seems that some "columns" of the dataframe (which
> where vectors) are factors and some not, but I didn't find an explanation
> for which become factor and which don't.
> (I know I can use I() to avoid the factor transformaton but I think it is
> not an optimal solution to avoid the factor type just because I don't kno
> how to deal with)

This is described in the ?data.frame man page:  "Character variables 
passed to 'data.frame' are converted to factor columns unless protected 
by 'I'."

> 2d I can't manage to deal with factors, so when I have some, I transform
> them in vectors (with levels()), but I think I miss the power and utility of
> the factor type ?

levels() is not the conversion you want.  That lists all the levels, but 
it doesn't tell you how they correspond to individual observations.  For 
example,

 > df <- data.frame(x=1:3, y=c('a','b','a'))
 > df
   x y
1 1 a
2 2 b
3 3 a
 > levels(df$y)
[1] "a" "b"

If you need to convert back to character values, use as.character():

 > as.character(df$y)
[1] "a" "b" "a"

For many purposes, you can ignore the fact that your data is stored as a 
factor instead of a character vector.  There are a few differences:

  1. You can't compare the levels of a factor unless you declared it to 
be ordered:

 > df$y[1] > df$y[2]
[1] NA
Warning message:
 > not meaningful for factors in: Ops.factor(df$y[1], df$y[2])

but

 > df$y <- ordered(df$y)
 > df$y[1] > df$y[2]
[1] FALSE

However, you need to watch out here: the comparison is done by the order 
of the factors, not an alphabetic comparison of their names:

 > levels(df$y) <- c("before", "after")
 > df
   x      y
1 1 before
2 2  after
3 3 before
 > df$y[1] > df$y[2]
[1] FALSE


  2. as.integer() works differently on factors:  it gets the position in 
the levels vector.  For example,

 > as.integer(df$y)
[1] 1 2 1
 > as.integer(as.character(df$y))
[1] NA NA NA
Warning message:
NAs introduced by coercion

There are other differences, but these are the two main ones that are 
likely to cause you trouble.

Duncan Murdoch




More information about the R-help mailing list