[R] problems with coercing a factor to be numeric

S Ellison S.Ellison at LGCGroup.com
Thu Jan 24 05:58:18 CET 2013

On 23 Jan 2013, at 21:36, "Francesco Sarracino" <f.sarracino at gmail.com> wrote:

> .... what I meant refers to the fact  that  I've read on "an R and
> S-plus companion to applied regression" about methods to alter the encoding
> of factors when using contrasts in regressions. These are options (for
> contrasts) that can be easily set as "option('contrasts')". This command
> changes the way R creates the dummies out of a factor and various methods
> are available.
> I was expecting that R might have had something similar that applied to my
> case, thus changing the way R attaches numeric values to my dummy variable.
> I am just surprised that such option doesn't exist. I was having wrong
> expectations.

Such options do exist, but at modelling time, not factor creation/conversion time.

When created, by calls to 'factor' or in functions like 'read.table', factors are stored internally as integers with a list of labels (what you see as factor levels) that go with each integer. Those internal integers start at 1 and go up. You can set the ordering of those labels (by specifying the "levels" argument in factor()) so that, for example, yes and no can be associated with (numeric) factor levels 1 and 2 respectively instead of the default ordering which would put 'no' alphabetically before 'yes'. (I find this choice particularly useful for orderings like "high", "medium", "low" for which the alphabetic ordering is not exactly intuitive; similarly alphabetic ordering puts '1', '2', '10' in the order '1', '10', '2' and so on, so that often needs specifying manually. It's also useful to specify levels if you want things like boxplots to come out in a particular order, as boxplots by default use the order of the factor levels).
The internal integer values are returned by 'as numeric'. If your factor level labels - which are always character - are also interpretable as numbers, you need 'as.character' to return the character strings and then 'as.numeric' to convert those. 

Now, up to this point you just have more or less arbitrary integers asociated with the original factor levels (the degree of arbitrariness depends on whether you specified the level order or let R use its default). These integers are not the contrasts used in model fitting. Contrasts are set at model matrix building time; they are not a fixed attribute of the factor. The internal numbering of levels  affects contrasts only to the extent that the numerical values used in setting contrasts are usually in the same order as the factor levels.  You can inspect the functions used to associate contrasts  with factor levels by using options("contrasts"). You can inspect the numerical values that would currently be used for a given factor with a call to contrasts(). You can change the contrast asignments globally using options() or explicitly in some model calls (lm, for example, has a contrasts argument) and if you like you can write your own contrast functions to set any values you like.  The most common are probably treatment contrasts, which set the first factor level as intercept and the rest as (unit) differences from that, and sum to zero contrasts which do what they say, setting contrasts that sum to zero by choosing a set like (-1, 0, 1). 

So you actually have a great deal of control over both the order in which labels are associated with factor levels and the (separate) values of contrasts associated with those factor levels at modelling time. 

The cost of that control is some complexity, and the time needed to learn what's going on to use it all properly. 

Hope that helps ...

S Ellison

This email and any attachments are confidential. Any use...{{dropped:8}}

More information about the R-help mailing list