[R] Basic Dummy Variable Creation

Fri Sep 5 17:50:59 CEST 2003

Dear Francisco,

At 08:31 AM 9/5/2003 -0500, Francisco J. Bido wrote:
>Hi There,
>
>While looking through the mailing list archive, I did not come across a 
>simple minded example regarding the creation of dummy variables.  The 
>Gauss language provides the command "y = dummydn(x,v,p)" for creating 
>dummy variables.
>Here:
>
>x = Nx1 vector of data to be broken up into dummy variables.
>v = Kx1 vector specifying the K-1 breakpoints
>p = positive integer in the range [1,K], specifying which column should be 
>dropped in the matrix of dummy variables.
>y = Nx(K-1) matrix containing the K-1 dummy variables.
>
>My recent mailing list archive inquiry has led me to examine R's 
>"model.matrix" but it has so many options that I'm not seeing the forest 
>because of the trees.  Is that really the easiest way? or is there 
>something similar to the dummydn command described above?
>
>To provide a concrete scenario, please consider the following.  Using the 
>above notation, say, I had:
>
>x <- c(1:10)      #data to be broken up into dummy variables
>v <- c(3,5,7)     #breakpoints
>p =  1                #drop this column to avoid dummy variable trap
>
>How can I get a matrix "y" that has the associated dummy variables for 
>columns?
>Thank You,
>-Francisco

My initial question would be why do you want to do this? Statistical-model 
formulas in R implicitly generate dummy variables (and other contrasts) 
directly from factors, so if this is the context that you had in mind, 
there's no need to generate the dummy variables explicitly.

If you really do want the matrix of dummy regressors, say for a factor 
named "factor," then you can use model.matrix() to get them. Because the 
default contrast type for unordered factors is "contr.treatment", which 
corresponds to 0/1 dummy regressors, you can get the dummy variables as 
model.matrix(~factor)[,-1]. Here I've removed the initial column of ones 
returned by model matrix. Alternatively, model.matrix(~ factor - 1) gives 
you a complete set of dummy regressors; you could then drop whichever 
column you wanted to.

More generally, if you haven't already done so you might see how 
linear-model formulas are implemented in R. All of the introductions to R 
cover this topic. I think that this is one of the strengths of the S 
language, by the way.

I hope that this helps,
  John
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox