[R] Collinearity? Cannot get logisticRidge{ridge} to work

peter dalgaard pdalgd at gmail.com
Thu May 28 11:26:52 CEST 2015


On 28 May 2015, at 00:06 , Kengo Inagaki <kengoing.gj at gmail.com> wrote:

> I did not understand complete separation quite well..
> Thank you very much for clarification.
> 
> Kengo
> 
> 2015-05-27 17:03 GMT-05:00 David Winsemius <dwinsemius at comcast.net>:
>> 
>> On May 27, 2015, at 3:00 PM, Kengo Inagaki wrote:
>> 
>>> Here is the result-
>>> 
>>>> with(a,  table(Sex, Therapy1,  Outcome) )
>>> , , Outcome = Alive
>>> 
>>>       Therapy1
>>> Sex      no yes
>>> female  0   4
>>> male    4   5
>>> 
>>> , , Outcome = Death
>>> 
>>>       Therapy1
>>> Sex      no yes
>>> female  6   3
>>> male    3   0
>> 
>> So no deaths when Female had no-Therapy1 and no survivors with the opposite for those variables. Complete separation.


Actually not quite complete separation, but just as bad.  If you look at the linear combination Sex + Therapy, you get

0 (female, no therapy)
1 (female, therapy OR male, no therapy
2 (male, therapy)


0: 6 dead, 0 survive
1: 6 dead, 8 survive
2: 0 dead, 5 survive

and any logistic curve through (1, log(6/8)) fits the middle point and the other two will be fitted better and better as the curve gets steeper, so the fit diverges. 

That's a general pattern: you can have complete separation except at one point and still get divergence. Similarly (and really just the same), if you have multiple regression with k parameters and there's a k-1 dimensional hyperplane in predictor space with all responses 0  on one side and 1 on the other, but possibly both 0 and 1 _on_ the hyperplane. Google tells me that this is called quasicomplete separation.

-pd

>> 
>> --
>> David.
>> 
>>> 
>>> 
>>> 2015-05-27 16:57 GMT-05:00 David Winsemius <dwinsemius at comcast.net>:
>>>> 
>>>> On May 27, 2015, at 2:49 PM, Kengo Inagaki wrote:
>>>> 
>>>>> Thank you very much for your rapid response. I sincerely appreciate your input.
>>>>> I am sorry for sending the previous email in HTML format.
>>>>> 
>>>>> with(a,  table(Sex, Therapy1) )   shows the following.
>>>>>        Therapy1
>>>>> Sex      no yes
>>>>> female  6   7
>>>>> male    7   5
>>>>> 
>>>>> and with(a,  table(Therapy1, Outcome) )
>>>>> elicit the following
>>>>> 
>>>>>      Outcome
>>>>> Sex      Alive Death
>>>>> female     4     9
>>>>> male       9     3
>>>>> 
>>>>>      Outcome
>>>>> Therapy1 Alive Death
>>>>>   no      4     9
>>>>>   yes     9     3
>>>> 
>>>> Then what about:
>>>> 
>>>> with(a,  table(Sex, Therapy1,  Outcome) )
>>>> 
>>>> --
>>>> David
>>>> 
>>>> 
>>>>> 
>>>>> As there is no zero cells, it does not seem to be complete separation.
>>>>> I really appreciate comments.
>>>>> 
>>>>> Kengo Inagaki
>>>>> Memphis, TN
>>>>> 
>>>>> 
>>>>> 2015-05-27 13:57 GMT-05:00 David Winsemius <dwinsemius at comcast.net>:
>>>>>> 
>>>>>> On May 27, 2015, at 10:10 AM, Kengo Inagaki wrote:
>>>>>> 
>>>>>>> I am currently working on a health care related project using R. I am
>>>>>>> learning R while working on data analysis.
>>>>>>> 
>>>>>>> Below is the part of the data in which i am encountering a problem.
>>>>>>> 
>>>>>>> 
>>>>>>> Case#    Sex         Therapy1             Therapy2             Outcome
>>>>>>> 
>>>>>>> 1              male      no
>>>>>>> no                           Alive
>>>>>>> 
>>>>>> 
>>>>>> snipped mangled data sent in HTML
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> "Outcome" is the response variable and "Sex", "Therapy1", "Therapy2" are
>>>>>>> predictor variables.
>>>>>>> 
>>>>>>> All of the predictors are significantly associated with the outcome by
>>>>>>> univariate analysis.
>>>>>>> 
>>>>>>> Logistic regression runs fine with most of the predictors when "Sex" and
>>>>>>> "Therapy1" are not included at the same time (This is a part of table that
>>>>>>> I cut out from a larger table for ease of
>>>>>>> 
>>>>>>> presentation and there are more predictors that i tested).
>>>>>> 
>>>>>> Please examine the data before reaching for ridge regression:
>>>>>> 
>>>>>> What does this show: ...
>>>>>> 
>>>>>>  with(a,  table(Sex, Therapy1) )
>>>>>> 
>>>>>> I predict you will see a zero cell entry. The read about "complete separation" and the so-called "Hauck-Donner effect".
>>>>>> 
>>>>>> --
>>>>>> David.
>>>>>>> 
>>>>>>> However, when "Sex" and "Therapy1" are included in logistic regression
>>>>>>> model at the same time, standard error inflates and p value gets close to 1.
>>>>>>> 
>>>>>>> The formula used is,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Model<-glm(Outcome~Sex+Therapy1,data=a,family=binomial) #I assigned a
>>>>>>> vector "a" to represent above table.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> After doing some reading, I suspect this might be collinearity, as vif
>>>>>>> values (using "vif()" function in car package) were sky high (8,875,841 for
>>>>>>> both "Sex" and "Therapy1").
>>>>>>> 
>>>>>>> Learning that ridge regression may be a solution, I attempted using
>>>>>>> logisticRidge {ridge} using the following formula, but i get the
>>>>>>> accomapnying error message.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> logisticRidge(a$Outcome~a$Sex+a$Therapy1)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Error in ifelse(y, log(p), log(1 - p)) :
>>>>>>> 
>>>>>>> invalid to change the storage mode of a factor
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> At this point I do not have an idea how to solve this and would like to
>>>>>>> seek help.
>>>>>>> 
>>>>>>> I really really appreciate your input!!!
>>>>>>> 
>>>>>>>    [[alternative HTML version deleted]]
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> David Winsemius
>>>>>> Alameda, CA, USA
>>>>>> 
>>>> 
>>>> David Winsemius
>>>> Alameda, CA, USA
>>>> 
>> 
>> David Winsemius
>> Alameda, CA, USA
>> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list