[R] Creating dummy vars with contrasts - why does the returned identity matrix contain all levels (and not n-1 levels) ?

Fri Sep 20 17:11:39 CEST 2013

On Sep 13, 2013, at 11:21 PM, E Joffe wrote:

> Hi David,
>
> First I ordered the levels of each factor in a descending order  
> based on
> frequency.
> Then, I used the following code to generate a matrix from the  
> dataframe with
> dummy variables and  subsequently run the glmnet (coxnet)
>
> ## tranform categorical variables into binary variables with dummy for
> trainSet
> predict_matrix <- model.matrix(~ ., data=trainSet,
>                              contrasts.arg = lapply
> (trainSet[,sapply(trainSet, is.factor)], contrasts))
>
> ## remove the status/time variables from the predictor matrix (x) for
> glmnet
> predict_matrix <- subset (predict_matrix, select=c(-time,-status))
>
> ## create a glmnet cox object using lasso regularization and cross
> validation
> glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")
>
>
> I hope I did not do anything wrong .....
>
> Can't thank you enough for your advice and interest.

Thank you for outlining the process that you used. It looks "from the  
outside" as though it respects the constraints on the first two  
argument imposed by the more constrained input requirements of  
cv.glmnet. I didn't realize that subset could accept a `-`sign as an  
operator inside a c() expression, but if you are getting success then  
I guess it must.

-- 
David.

> Erel
>
>
>
> -----Original Message-----
> From: David Winsemius [mailto:dwinsemius at comcast.net]
> Sent: Friday, September 13, 2013 8:51 PM
> To: E Joffe
> Cc: r-help at r-project.org
> Subject: Re: [R] Creating dummy vars with contrasts - why does the  
> returned
> identity matrix contain all levels (and not n-1 levels) ?
>
>
> On Sep 13, 2013, at 9:33 AM, E Joffe wrote:
>
>> Thank you so much for your answer  !
>> As far as I understand, glmnet doesn't accept categorical variables
>> only binary factors - so I had to create dummy variables for all
>> categorical variables.
>
> I was rather puzzled by your question. The conventions used by  
> glmnet should
> prevent constrasts from being pre-specified. Only matrices are  
> accepted as
> data objects and one cannot assign contrast attributes to matrix  
> columns.
>
>> It worked perfectly.
>> Erel
>>
>>
>> Erel Joffe MD MSc
>> School of Biomedical Informatics
>> University of Texas - Health Science Center in Houston
>> 832.287.0829 (c)
>>
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Friday, September 13, 2013 3:05 PM
>> To: E Joffe
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Creating dummy vars with contrasts - why does the
>> returned identity matrix contain all levels (and not n-1 levels) ?
>>
>>
>> On Sep 13, 2013, at 4:15 AM, E Joffe wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I have a problem with creating an identity matrix for glmnet by  
>>> using
>>> the contrasts function.
>>
>> Why do you want to do this?
>>
>>> I have a factor with 4 levels.
>>>
>>> When I create dummy variables I think there should be n-1 variables
>>> (in this case 3) - so that the contrasts would be against the
>>> baseline level.
>>>
>>> This is also what is written in the help file for 'contrasts'.
>>>
>>> The problem is that the function creates a matrix with n variables
>>> (i.e. the same as the number of levels) and not n-1 (where I would
>>> have a baseline level for comparison).
>>
>> Only if you specify contrasts=FALSE does it do so and this is
>> documented in that help file.
>>>
>>>
>>>
>>> My questions are:
>>>
>>> 1.       How can I create a matrix with n-1 dummy vars ?
>>
>> See below.
>>
>>> was I supposed to
>>> define explicitly that I want contr.treatment (contrasts) ?
>>
>> No need to do so.
>>
>>>
>>> 2.       If it is not possible, how should I interpret the hazard
>>> ratios in
>>> the Cox regression I am generating (I use glmnet for variable
>>> selection and
>>> then generate a Cox regression)  - That is, if I get an HR of 3 for
>>> the
>>> variable 300mg what does it mean ? the hazard is 3 times higher of
>>> what ?
>>>
>>
>> Relative hazards are generally referenced to the "baseline hazard",
>> i.e. the hazard for a group with the omitted level for treatment
>> constrasts and the mean value for any numeric predictors.
>>
>>> Here is some code to reproduce the issue:
>>>
>>> # Create a 4 level example factor
>>>
>>> trt <- factor( sample( c("PLACEBO", "300 MG", "600 MG", "1200 MG"),
>>>
>>>                  100, replace=TRUE ) )
>>
>> # If your intent is to use constrasts different than the defaults  
>> used
>> by
>> #  regression functions, these factor contrasts need to be assigned,
>> either
>> # within the construction of the factor or after the fact.
>>
>>> contrasts(trt)
>>     300 MG 600 MG PLACEBO
>> 1200 MG      0      0       0
>> 300 MG       1      0       0
>> 600 MG       0      1       0
>> PLACEBO      0      0       1
>>
>> # the default value for the contrasts parameter is TRUE and the
>> default type is treatement
>>
>> # That did not cause any change to the 'trt'-object:
>> trt
>>
>> #To make a change you need to use the `contrasts<-` function:
>>
>> contrasts (trt) <- contrasts(trt)
>> trt
>>
>>>
>>> # Use contrasts to get the identity matrix of dummy variables to be
>>> used in
>>> glmnet
>>>
>>> trt2 <- contrasts (trt,contrasts=FALSE)
>>>
>>> Results (as you can see all levels are represented in the identity
>>> matrix):
>>>
>>>> levels (trt)
>>> [1] "1200 MG" "300 MG"  "600 MG"  "PLACEBO"
>>>
>>>
>>>> print (trt2)
>>>
>>>   1200 MG 300 MG 600 MG PLACEBO
>>>
>>> 1200 MG       1      0      0       0
>>>
>>> 300 MG        0      1      0       0
>>>
>>> 600 MG        0      0      1       0
>>>
>>> PLACEBO       0      0      0       1
>>>
>>>
>>>
>>> 	[[alternative HTML version deleted]]
>>
>> Rhelp is a plain text mailing list.
>>
>> -- 
>> David Winsemius, MD
>> Alameda, CA, USA
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Alameda, CA, USA
>
>

David Winsemius, MD
Alameda, CA, USA