[R] Why do we have to turn factors into characters for various functions?

Petr Savicky savicky at cs.cas.cz
Sun Dec 12 20:12:55 CET 2010


On Sun, Dec 12, 2010 at 12:48:30AM +0200, Tal Galili wrote:
> Hello dear R-help mailing list,
> 
> My question is *not* about how factors are implemented in R (which is, if I
> understand correctly, that factors keeps numbers and assign levels to them).
> My question *is* about why so many functions that work on factors don't
> treat them as characters by default?

Personally, i try to use factors only when there is a specific reason
for this and character type otherwise. Factors are natural in the data
used for construction of a classification model or for categorical
attributes, also for preparing input to table() function and related things.

> Here are two simple examples:
> Example one turning the characters inside a factor into numeric:
> 
> x <- factor(4:6)
> as.numeric(x) # output: 1 2 3
> as.numeric(as.character(x)) # output: 4 5 6  # isn't this what we wanted?

If you are concerned with computing time, then applying as.numeric()
only to the levels is probably better

  x <- factor(rep(4:6, times=1000000))
  cpu1 <- system.time( out1 <- as.numeric(as.character(x)) )
  cpu2 <- system.time( out2 <- as.numeric(levels(x))[as.integer(x)] )
  rbind(cpu1, cpu2)

       user.self sys.self elapsed user.child sys.child
  cpu1     0.570    0.031   0.601          0         0
  cpu2     0.042    0.027   0.070          0         0

> Is it that implementing a switch of factors to characters as the default in
> some of the basic function will cause old code to break?

I think that this is an important part of the reason.

Petr Savicky.



More information about the R-help mailing list