[R] "Centered" dummy variables; non zero/one coding

Wed Oct 13 12:48:50 CEST 2004

This can done by setting a contrast function or matrix on a variable.
Look in e.g. chapter 6 of MASS (the only comprehensive tutorial on coding 
factors in R, it seems).

On Tue, 12 Oct 2004, Peter Holck wrote:

> I'm uncertain if this is perhaps a stupid question:
> 
> I want to create "centered" dummy variables to use in a call to glm(), and
> wondering if there's some slick method in R to do so.  That is, rather than
> have a factor, which results in a glm() fit returning coefficients
> specifying either absence or presence of the factor, I'd like to fit a glm()
> without intercept such that the estimated coefficients (standard errors)
> represent the "average" value in my data set for that variable.  

Is that really what you want?  An `average' person having linear predictor 
0, or more precisely, the linear predictor have average zero over the 
dataset?  What family of glm is this?

> An example: a data set has Race specified with 4 levels.  I can manually
> specify 4 dummy variables for a no-intercept model with each variable rather
> than having a value of zero or one, has a centered value based on its
> frequency of occurrence in the data set.  Thus if 30% of the records in the
> data set have Race of Hispanic, I can define a variable HISP that has a
> value of either -.3 or .7, resulting in my coefficient estimate for HISP
> representing the effect of an "average" person in the database (and a
> corresponding valid standard error).   

Nope.  A person can only have one race, so the coefficient estimates can 
only represent jointly the effect of picking one of the possible races.

I think what you are striving for is that the average of the term `race' 
be zero over the whole dataset.  That's easy -- just compute the average 
and subtract it via an offset term.

Once you have two or more factor predictors you will get aliasing your 
way.

> One way to create these "centered dummy variables" from the original factor
> is:
> 		"B"=scale(RACE=="B",scale=F),
> 		"W"=scale(RACE=="W",scale=F),
> 		"H"=scale(RACE=="H",scale=F),
> 		"OTHRACE"=scale(RACE=="OTHER",scale=F)
> 
> However I wonder if there is some method in R to avoid having to manually
> define a large number of these dummy variables for a more complicated
> dataset.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595