[R] Questions about lrm, validate, pentrace (Re: BMA, logistic regression, odds ratio, model reduction etc)

Frank Harrell f.harrell at vanderbilt.edu
Mon Apr 25 17:21:33 CEST 2011


You've done a lot of good work on this.  Yes I would say you have moderate
overfitting with the first model.  The only thing that saved you from having
severe overfitting is that there seems to be a signal present [I am assume
this model is truly pre-specified and was not developed at all by looking at
patterns of responses Y.]

The use of backwards stepdown demonstrated much worse overfitting.  This is
in line with what we know about the damage of stepwise selection methods
that do not incorporate shrinkage.  I would throw away the stepwise
regression model.  You'll find that the model selected is entirely
arbitrary.  And you can't use the "selected" variables in any re-fit of the
model, i.e., you can't use lrm pretending that the two remaining variables
were pre-specified.  Stepwise regression methods only seem to help.  When
assessed properly we see that is an illusion.

You are using penalizing properly but you did not print the full table of
penalties vs. effective AIC.  We don't have faith that you penalized enough. 
I tend to run pentrace using a very wide range of possible penalties to make
sure I've found the global optimum.

Penalization somewhat solves the EPV problem but there is no substitute for
getting more data.

You can run validate specifying your final penalty as an argument.

Frank



細田弘吉 wrote:
> 
> According to the advice, I tried rms package.
> Just to make sure, I have data of 104 patients (x6.df), which consists 
> of 5 explanatory variables and one binary outcome (poor/good) (previous 
> model 2 strategy). The outcome consists of 25 poor results and 79 good 
> results. Therefore, My events per variable (EPV) is only 5 (much less 
> than the rule of thumb of 10).
> 
> My questions are about validate and pentrace in rms package.
> I present some codes and results.
> I appreciate anybody's help in advance.
> 
>  > x6.lrm <- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore, 
> data=x6.df, x=T, y=T)
> 
>  > x6.lrm
> ...
> Obs  104    LR chi2      29.24    R2       0.367    C       0.816
>   negative 79    d.f.         5    g        1.633    Dxy     0.632
>   positive 25    Pr(> chi2) <0.0001   gr    5.118    gamma   0.632
> max |deriv| 1e-08                    gp    0.237    tau-a   0.233
>                                       Brier   0.127
> 
>                 Coef    S.E.   Wald Z Pr(>|Z|)
> Intercept      -5.5328 2.6287 -2.10  0.0353
> stenosis       -0.0150 0.0284 -0.53  0.5979
> x1              3.0425 0.9100  3.34  0.0008
> x2             -0.7534 0.4519 -1.67  0.0955
> procedure       1.2085 0.5717  2.11  0.0345
> ClinicalScore   0.3762 0.2287  1.65  0.0999
> 
> It seems not too bad. Next, validation by bootstrap ...
> 
>  > validate(x6.lrm, B=200, bw=F)
>            index.orig training    test optimism index.corrected   n
> Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
> R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
> Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
> Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
> Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
> D             0.2716   0.3368  0.2275   0.1093          0.1623 200
> U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
> Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
> B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
> g             1.6328   2.0740  1.4647   0.6093          1.0235 200
> gp            0.2367   0.2529  0.2189   0.0341          0.2026 200
> 
> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum 
> absolute error is estimated to be 0.099. The changes in slope and 
> intercept are also more substantial. In all, there is evidence that I am 
> somewhat overfitting the data, right?.
> 
> Furthermore, using step-down variable selection ...
> 
>  > validate(x6.lrm, B=200, bw=T)
> 
> 		Backwards Step-down - Original Model
> 
>   Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
>   stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
>   ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
>   x2             2.86   1    0.0910 5.74     3    0.1252 -0.26
> 
> Approximate Estimates after Deleting Factors
> 
>               Coef   S.E. Wald Z         P
> Intercept  -5.865 1.4136 -4.149 3.336e-05
> x1          2.915 0.8685  3.357 7.889e-04
> procedure   1.072 0.5590  1.918 5.508e-02
> 
> Factors in Final Model
> 
> [1] x1         procedure
>            index.orig training    test optimism index.corrected   n
> Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
> R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
> Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
> Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
> Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
> D             0.2038   0.3130  0.1970   0.1160          0.0877 200
> U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
> Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
> B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
> g             1.2628   1.9524  1.3222   0.6302          0.6326 199
> gp            0.2041   0.2430  0.2043   0.0387          0.1654 199
> 
> If I select only two variables (x1 and procedure), bias-corrected Dxy 
> goes down to 0.45.
> 
> [Question 1]
> I have EPV problem. Even so, should I keep the full model (5-variable 
> model)? or can I use the 2-variable (x1 and procedure) model which the 
> validate() with step-down provides?
> 
> [Question 2]
> If I use 2-variable model, should I do
> x2.lrm <- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
> or keep the value showed above by validate function?
> 
> Next, shrinkage ...
> 
>  > pentrace(x6.lrm, seq(0, 5.0, by=0.05))
> Best penalty:
> penalty         df
>     3.05   4.015378
> 
> The best penalty is 3.05. So, I update it with this penalty to obtain 
> the corresponding penalized model:
> 
>  > x6.lrm.pen <- update(x6.lrm, penalty=3.05, x=T, y=T)
>  > x6.lrm.pen
> .....
> Penalty factors
> 
>   simple nonlinear interaction nonlinear.interaction
>     3.05      3.05        3.05                  3.05
> Final penalty on -2 log L
>       [,1]
> [1,]  3.8
> 
> Obs     104    LR chi2      28.18    R2       0.313    C       0.818
>   negative    79    d.f.     4.015    g        1.264    Dxy     0.635
>   positive    25   Pr(> chi2) <0.0001 gr       3.538    gamma   0.637
> max |deriv| 3e-05                    gp       0.201    tau-a   0.234
>                                       Brier    0.129
> 
>                 Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
> Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
> stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
> x1              2.3605 0.7254  3.25  0.0011    0.6054
> x2             -0.5385 0.3653 -1.47  0.1404    1.2851
> procedure       0.9247 0.4844  1.91  0.0563    0.8576
> ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779
> 
> Arrange the coefficients of the two models side by side, and also list 
> the difference between the two:
> 
>  > cbind(coef(x6.lrm), coef(x6.lrm.pen),
> abs(coef(x6.lrm)-coef(x6.lrm.pen)))
>                        [,1]        [,2]        [,3]
> Intercept      -5.53281808 -4.72464766 0.808170417
> stenosis       -0.01496757 -0.01050797 0.004459599
> x1              3.04248257  2.36051833 0.681964238
> x2             -0.75335619 -0.53854750 0.214808685
> procedure       1.20847252  0.92474708 0.283725441
> ClinicalScore   0.37623189  0.30457557 0.071656322
> 
> [Question 3]
> Is this penalized model the one I should present for my colleagues?
> I still have EPV problem. Or is EPV problem O.K. if I use penalization?
> 
> I am still wondering about what I can do to avoid EPV problem. 
> Collecting new data would be a long-time and huge work...
> 
> 
> (11/04/22 1:46), khosoda at med.kobe-u.ac.jp wrote:
>> Thank you for your comment.
>> I forgot to mention that varclus and pvclust showed similar results for
>> my data.
>>
>> BTW, I did not realize rms is a replacement for the Design package.
>> I appreciate your suggestion.
>> --
>> KH
>>
>> (11/04/21 8:00), Frank Harrell wrote:
>>> I think it's OK. You can also use the Hmisc package's varclus function.
>>> Frank
>>>
>>>
>>> 細田弘吉 wrote:
>>>>
>>>> Dear Prof. Harrel,
>>>>
>>>> Thank you very much for your quick advice.
>>>> I will try rms package.
>>>>
>>>> Regarding model reduction, is my model 2 method (clustering and
>>>> recoding
>>>> that are blinded to the outcome) permissible?
>>>>
>>>> Sincerely,
>>>>
>>>> --
>>>> KH
>>>>
>>>> (11/04/20 22:01), Frank Harrell wrote:
>>>>> Deleting variables is a bad idea unless you make that a formal part of
>>>>> the
>>>>> BMA so that the attempt to delete variables is penalized for.
>>>>> Instead of
>>>>> BMA I recommend simple penalized maximum likelihood estimation (see
>>>>> the
>>>>> lrm
>>>>> function in the rms package) or pre-modeling data reduction that is
>>>>> blinded
>>>>> to the outcome variable.
>>>>> Frank
>>>>>
>>>>>
>>>>> 細田弘吉 wrote:
>>>>>>
>>>>>> Hi everybody,
>>>>>> I apologize for long mail in advance.
>>>>>>
>>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>>> variables
>>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>>> results and 79 good results. I tried to analyze the data with
>>>>>> logistic
>>>>>> regression. However, the 15 variables and 25 events means events per
>>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>>> used R
>>>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>>>> problem.
>>>>>>
>>>>>> model 1 (full model):
>>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>>
>>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>>> summary(x16.bic.glm)
>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>
>>>>>> 62 models were selected
>>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>>
>>>>>> p!=0 EV SD model 1 model2
>>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>>> -5.1536
>>>>>> age 3.3 0.0001634 0.007258 .
>>>>>> sex 4.0
>>>>>> .M -0.0243145 0.220314 .
>>>>>> side 10.8
>>>>>> .R 0.0811227 0.301233 .
>>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>>> HT 4.6
>>>>>> .positive 0.0199299 0.161769 . .
>>>>>> DM 3.3
>>>>>> .positive -0.0019986 0.105910 . .
>>>>>> IHD 3.5
>>>>>> .positive 0.0077626 0.122593 . .
>>>>>> smoking 9.1
>>>>>> .positive 0.0611779 0.258402 . .
>>>>>> hyperlipidemia 16.0
>>>>>> .positive 0.1784293 0.512058 . .
>>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>>
>>>>>>
>>>>>> nVar 2 2
>>>>>> 1 3 3
>>>>>> BIC -376.9082
>>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>>> post prob 0.104
>>>>>> 0.087 0.077 0.061 0.032
>>>>>>
>>>>>> [Question 1]
>>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval
>>>>>> from
>>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>>> standard deviation)?
>>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>>> exp(3.1707661)
>>>>>> [1] 23.82573 -----> odds ratio
>>>>>>> exp(3.1707661+1.96*0.892034)
>>>>>> [1] 136.8866
>>>>>>> exp(3.1707661-1.96*0.892034)
>>>>>> [1] 4.146976
>>>>>> ------------------> 95%CI (4.1 to 136.9)
>>>>>> Is this O.K.?
>>>>>>
>>>>>> [Question 2]
>>>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>>> explanatory variables and reconstruct new model without those
>>>>>> variables
>>>>>> for new session of BMA?
>>>>>>
>>>>>> model 2 (reduced model):
>>>>>> I used R package, "pvclust", to reduce the model. The result
>>>>>> suggested
>>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>>> counting the number of clinical features. For 9 features (sex, side,
>>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>>>> made up new data set (x6.df), which consists of 5 variables
>>>>>> (stenosis,
>>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>>> (poor/good). Then, for alternative BMA session...
>>>>>>
>>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>>> summary(BMAx6.glm)
>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>> Call:
>>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>>
>>>>>>
>>>>>> 13 models were selected
>>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>>
>>>>>> p!=0 EV SD model 1 model 2
>>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>>
>>>>>>
>>>>>> nVar 2 2 1
>>>>>> 3 3
>>>>>> BIC -376.9082 -376.5588
>>>>>> -376.3094 -375.8468 -375.5025
>>>>>> post prob 0.208 0.175
>>>>>> 0.154 0.122 0.103
>>>>>>
>>>>>> [Question 3]
>>>>>> Am I doing it correctly or not?
>>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>>
>>>>>> [Question 4]
>>>>>> I still have 5 variables, which violates the rule of thumb, "EPV>
>>>>>> 10".
>>>>>> Is it permissible to delete "stenosis" variable because of small
>>>>>> value
>>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>>
>>>>>> Sorry for long post.
>>>>>>
>>>>>> I appreciate your help very much in advance.
>>>>>>
>>>>>> --
>>>>>> KH
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>> -----
>>>>> Frank Harrell
>>>>> Department of Biostatistics, Vanderbilt University
>>>>> --
>>>>> View this message in context:
>>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>>
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list