[R] Lasso with Categorical Variables

Steve Lianoglou mailinglist.honeypot at gmail.com
Mon May 2 19:51:00 CEST 2011


Hi,

On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2 at ncsu.edu> wrote:
> Hi! This is my first time posting. I've read the general rules and
> guidelines, but please bear with me if I make some fatal error in
> posting. Anyway, I have a continuous response and 29 predictors made
> up of continuous variables and nominal and ordinal categorical
> variables. I'd like to do lasso on these, but I get an error. The way
> I am using "lars" doesn't allow for the factors. Is there a special
> option or some other method in order to do lasso with cat. variables?
>
> Here is and example (considering ordinal variables as just nominal):
>
> set.seed(1)
> Y <- rnorm(10,0,1)
> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
> X3 <- sample(x=30:55, size=10, replace=TRUE)  # think age
> X4 <- rchisq(10, df=4, ncp=0)
> X <- data.frame(X1,X2,X3,X4)
>
>> str(X)
> 'data.frame':   10 obs. of  4 variables:
>  $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2
>  $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
>  $ X3: int  51 46 50 44 43 50 30 42 49 48
>  $ X4: num  2.86 1.55 1.94 2.45 2.75 ...
>
>
> I'd like to do:
> obj <- lars(x=X, y=Y, type = "lasso")
>
> Instead, what I have been doing is converting all data to continuous
> but I think this is really bad!

Yeah, it is.

Check out the "Categorical Predictor Variables" section here for a way
to handle such predictor vars:
http://www.psychstat.missouristate.edu/multibook/mlt08m.html

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the R-help mailing list