[R] Why do we have to turn factors into characters for various functions?

Joshua Wiley jwiley.psych at gmail.com
Sun Dec 12 01:13:33 CET 2010


Hi Tal,

I always think of factors as a way of imposing (however arbitrarily)
order on some variable.  To that extent, the key aspect is first,
second, third, etc., represented numerically in factors as 1, 2, 3,
etc. .  The labels are for convenience and interpretation.  Consider:

x <- factor(c(5, 4, 6))
y <- factor(c(6, 5, 7))
as.numeric(x)
as.numeric(y)

Is there numeric or character value of 5 more important?  Or is its
relative position?

If you have character data that you might want to split and
manipulate, store it as a string variable (you can set an option so
stringsAsFactors = FALSE by default in read.table()).  If your factor
labels are numeric, that suggests it might have been better stored as
numeric in the first place.  Generally, when I find myself converting
factors to numeric or character class data, it means I've been using
factor() to recode data (which is not its intended purpose).

My 2 cents.

Cheers,

Josh

On Sat, Dec 11, 2010 at 2:48 PM, Tal Galili <tal.galili at gmail.com> wrote:
> Hello dear R-help mailing list,
>
> My question is *not* about how factors are implemented in R (which is, if I
> understand correctly, that factors keeps numbers and assign levels to them).
> My question *is* about why so many functions that work on factors don't
> treat them as characters by default?
>
> Here are two simple examples:
> Example one turning the characters inside a factor into numeric:
>
> x <- factor(4:6)
> as.numeric(x) # output: 1 2 3
> as.numeric(as.character(x)) # output: 4 5 6  # isn't this what we wanted?
>
>
> Example two, using strsplit on a factor:
>
> x <- factor(paste(letters[4:6], 4:6, sep="A"))
> strsplit(x, "A") # will result in an error:  # Error in strsplit(x, "A") :
> non-character argument
> strsplit(as.character(x), "A") # will work and split
>
>
> So what is the reason this is the case?
> Is it that implementing a switch of factors to characters as the default in
> some of the basic function will cause old code to break?
> Is it a better design in some other way?
>
> I am curious to know the reason for this.
>
> Thank you for your reading,
> Tal
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/



More information about the R-help mailing list