[R] Why do we have to turn factors into characters for various functions?

Sun Dec 12 16:55:58 CET 2010

Well ... because....

These are language design issues, and you therefore need to understand
something about computer languages to understand the context. Here are
brief answers that offer my take; others may be able to fill in or
correct.

1. The factor type/class is R's version of C's enum declaration. So
you might want to read about that. It can save a lot of storage space
(perhaps not as relevant now as 30 years ago), provide associative
arrays, and so forth. This is quite useful. But, as you have observed,
there are some gotcha's due to confusion between the internal
representation of factors (as integers) and the external view (as
vectors of character strings given by the levels attribute). Some
quite wise folks (Terry Therneau is one, I believe) have found factors
sufficiently annoying (especially within data frames) that they
recommend their avoidance.

2. The business with strsplit() reflects R's object oriented structure
and has essentially nothing to do with factors, per se. strsplit() is
a function defined only for character data and is not a generic with a
method for factors. Period. Whence the error message. You could, of
course, easily make it generic with a factor method (via as.character,
presumably).

HTH,

-- Bert

On Sat, Dec 11, 2010 at 4:13 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
> Hi Tal,
>
> I always think of factors as a way of imposing (however arbitrarily)
> order on some variable.  To that extent, the key aspect is first,
> second, third, etc., represented numerically in factors as 1, 2, 3,
> etc. .  The labels are for convenience and interpretation.  Consider:
>
> x <- factor(c(5, 4, 6))
> y <- factor(c(6, 5, 7))
> as.numeric(x)
> as.numeric(y)
>
> Is there numeric or character value of 5 more important?  Or is its
> relative position?
>
> If you have character data that you might want to split and
> manipulate, store it as a string variable (you can set an option so
> stringsAsFactors = FALSE by default in read.table()).  If your factor
> labels are numeric, that suggests it might have been better stored as
> numeric in the first place.  Generally, when I find myself converting
> factors to numeric or character class data, it means I've been using
> factor() to recode data (which is not its intended purpose).
>
> My 2 cents.
>
> Cheers,
>
> Josh
>
> On Sat, Dec 11, 2010 at 2:48 PM, Tal Galili <tal.galili at gmail.com> wrote:
>> Hello dear R-help mailing list,
>>
>> My question is *not* about how factors are implemented in R (which is, if I
>> understand correctly, that factors keeps numbers and assign levels to them).
>> My question *is* about why so many functions that work on factors don't
>> treat them as characters by default?
>>
>> Here are two simple examples:
>> Example one turning the characters inside a factor into numeric:
>>
>> x <- factor(4:6)
>> as.numeric(x) # output: 1 2 3
>> as.numeric(as.character(x)) # output: 4 5 6  # isn't this what we wanted?
>>
>>
>> Example two, using strsplit on a factor:
>>
>> x <- factor(paste(letters[4:6], 4:6, sep="A"))
>> strsplit(x, "A") # will result in an error:  # Error in strsplit(x, "A") :
>> non-character argument
>> strsplit(as.character(x), "A") # will work and split
>>
>>
>> So what is the reason this is the case?
>> Is it that implementing a switch of factors to characters as the default in
>> some of the basic function will cause old code to break?
>> Is it a better design in some other way?
>>
>> I am curious to know the reason for this.
>>
>> Thank you for your reading,
>> Tal
>>
>> ----------------Contact
>> Details:-------------------------------------------------------
>> Contact me: Tal.Galili at gmail.com |  972-52-7275845
>> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
>> www.r-statistics.com (English)
>> ----------------------------------------------------------------------------------------------
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> University of California, Los Angeles
> http://www.joshuawiley.com/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Bert Gunter
Genentech Nonclinical Biostatistics
467-7374
http://devo.gene.com/groups/devo/depts/ncb/home.shtml