[R] Lasso with Categorical Variables

Andrew Robinson A.Robinson at ms.unimelb.edu.au
Tue May 3 03:27:38 CEST 2011


On Mon, May 02, 2011 at 05:22:57PM -0400, Clemontina Alexander wrote:
> Thanks for your response, but I guess I didn't make my question clear.
> I am already familiar with the concept of dummy variables and
> regression in R. My question is, can the "lars" package (or some other
> lasso algorithm) handle factors? I did use dummy variables in my
> original data, but lars (lasso) only shrank the coefficients of some
> of the levels of one factor to 0. Is this the correct thing to do?

It's because, so far as the linear model is concerned, factors are a
convenience to help us handle the dummy variables. So, yes, it's the
correct thing to do.  It sounds to me as though you are after a
shrinkage device that will treat the factor as a whole. 

> Because intuitively it seems like I would want to shrink the whole
> factor coefficient to 0. If this is correct, what is the
> interpretation? For example, for X1, if lasso drops the coefficient
> for levels A and B, but not C and D, does this mean that X1 should be
> included in the model?

It means that X1 should be recoded to be C, D, and the rest. 

Cheers

Andrew

> Thanks.
> 
> 
> 
> On Mon, May 2, 2011 at 2:47 PM, David Winsemius <dwinsemius at comcast.net> wrote:
> >
> > On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:
> >
> >> Hi,
> >>
> >> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2 at ncsu.edu>
> >> wrote:
> >>>
> >>> Hi! This is my first time posting. I've read the general rules and
> >>> guidelines, but please bear with me if I make some fatal error in
> >>> posting. Anyway, I have a continuous response and 29 predictors made
> >>> up of continuous variables and nominal and ordinal categorical
> >>> variables. I'd like to do lasso on these, but I get an error. The way
> >>> I am using "lars" doesn't allow for the factors. Is there a special
> >>> option or some other method in order to do lasso with cat. variables?
> >>>
> >>> Here is and example (considering ordinal variables as just nominal):
> >>>
> >>> set.seed(1)
> >>> Y <- rnorm(10,0,1)
> >>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
> >>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
> >>> X3 <- sample(x=30:55, size=10, replace=TRUE)  # think age
> >>> X4 <- rchisq(10, df=4, ncp=0)
> >>> X <- data.frame(X1,X2,X3,X4)
> >>>
> >>>> str(X)
> >>>
> >>> 'data.frame':   10 obs. of  4 variables:
> >>>  $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2
> >>>  $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
> >>>  $ X3: int  51 46 50 44 43 50 30 42 49 48
> >>>  $ X4: num  2.86 1.55 1.94 2.45 2.75 ...
> >>>
> >>>
> >>> I'd like to do:
> >>> obj <- lars(x=X, y=Y, type = "lasso")
> >>>
> >>> Instead, what I have been doing is converting all data to continuous
> >>> but I think this is really bad!
> >>
> >> Yeah, it is.
> >>
> >> Check out the "Categorical Predictor Variables" section here for a way
> >> to handle such predictor vars:
> >> http://www.psychstat.missouristate.edu/multibook/mlt08m.html
> >
> > Steve's citation is somewhat helpful, but not sufficient to take the next
> > steps. You can find details regarding the mechanics of typical linear
> > regression in R on the ?lm page where you find that the factor variables are
> > typically handled by model.matrix. See below:
> >
> >> model.matrix(~X1 + X2 + X3 + X4, X)
> >   (Intercept) X1B X1C X1D X2F X2G X2H X2I X3        X4
> > 1            1   0   0   1   0   1   0   0 51 2.8640884
> > 2            1   0   0   0   0   0   1   0 46 1.5462243
> > 3            1   0   1   0   0   1   0   0 50 1.9430901
> > 4            1   0   0   0   1   0   0   0 44 2.4504180
> > 5            1   1   0   0   0   0   0   1 43 2.7535052
> > 6            1   1   0   0   0   0   0   1 50 1.6200326
> > 7            1   0   0   0   0   0   0   1 30 0.5750533
> > 8            1   1   0   0   0   0   0   0 42 5.9224777
> > 9            1   0   0   1   0   0   0   1 49 2.0401528
> > 10           1   1   0   0   0   1   0   0 48 6.2995288
> > attr(,"assign")
> >  [1] 0 1 1 1 2 2 2 2 3 4
> > attr(,"contrasts")
> > attr(,"contrasts")$X1
> > [1] "contr.treatment"
> >
> > attr(,"contrasts")$X2
> > [1] "contr.treatment"
> >
> > The numeric variables are passed through, while the dummy variables for
> > factor columns are constructed (as treatment contrasts) and the whole thing
> > it returned in a neat package.
> >
> > --
> > David.
> >>
> >> HTH,
> >> -steve
> >>
> > --
> > David Winsemius, MD
> > Heritage Laboratories
> > West Hartford, CT
> >
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Andrew Robinson  
Program Manager, ACERA 
Department of Mathematics and Statistics            Tel: +61-3-8344-6410
University of Melbourne, VIC 3010 Australia               (prefer email)
http://www.ms.unimelb.edu.au/~andrewpr              Fax: +61-3-8344-4599
http://www.acera.unimelb.edu.au/

Forest Analytics with R (Springer, 2011) 
http://www.ms.unimelb.edu.au/FAwR/
Introduction to Scientific Programming and Simulation using R (CRC, 2009): 
http://www.ms.unimelb.edu.au/spuRs/



More information about the R-help mailing list