[R] subset selection for logistic regression

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Mar 3 00:23:49 CET 2005


Christian Hennig wrote:
> Perhaps I should not write it because I will discredit myself with this
> but...
> 
> Suppose I have a setup with 100 variables and some 1000 cases and I want to
> boil down the number of variables to a maximum of 10 for practical reasons
> even if I lose 10% prediction quality by this (for example because it is
> expensive to measure all variables on new cases).  
> 
> Is it really so wrong to use a stepwise method?

Yes.  Read about model uncertainty and bias in models developed using 
stepwise methods.  One exception: if there is a large number of 
variables with truly zero regression coefficients, and the rest are not 
very weak, stepwise can sort things out fairly well.  But you never know 
this in advance.

> Let's say I divide the sample into three parts and do variable selction on
> the first part, estimation on the second and test on the third part (this
> solves almost all problems Frank is talking about on p. 56/57 in his
> excellent book). Is there always a tractable alternative? 

That's a good way to find out how bad the method is, not to fix the 
problems inherent in it.

> 
> Of course it is wrong to interpret the selected variables as "the true
> influences" and all others as "unrelated", but if I don't do that?
> 
> If it should really be a taboo to do stepwise variable selection, why are p.
> 58/59 of "Regression Modeling Strategies" devoted to "how to do it of you
> must"?

Stress on "if".  And note that if you ask what is the optimum alpha for 
variables to be kept in the model when doing backwards stepdown, it's 
alpha=1.0.  A good compromise is alpha=0.5.  See

@Article{ste01pro,
   author = 		 {Steyerberg, Ewout W. and Eijkemans, Marinus
   J. C. and Harrell, Frank E. and Habbema, J. Dik F.},
   title = 		 {Prognostic modeling with logistic regression
   analysis: {In} search of a sensible strategy in small data sets},
   journal = 	 Medical Decision Making,
   year = 		 2001,
   volume =		 21,
   pages =		 {45-56},
   annote =		 {shrinkage; variable selection; dichotomization of
   continuous varibles; sign of regression coefficient; calibration; 
validation}
}

And on Bert's excellent question about why shrinkage is not used more 
often, here is our attempt at a remedy:

@Article{moo04pen,
   author = 		 {Moons, K. G. M. and Donders, A. Rogier T. and
Steyerberg, E. W. and Harrell, F. E.},
   title = 		 {Penalized maximum likelihood estimation to directly
adjust diagnostic and prognostic prediction models for overoptimism: a
clinical example},
   journal = 	 J Clinical Epidemiology,
   year = 		 2004,
   volume =		 57,
   pages =		 {1262-1270},
   annote =		 {prediction 
research;overoptimism;overfitting;penalization;bootstrapping;shrinkage}
}

Frank


> 
> Please forget my name;-)
> 
> Christian
> 
> On Wed, 2 Mar 2005, Berton Gunter wrote:
> 
> 
>>To clarify Frank's remark ...
>>
>>A prominent theme in statistical research over at least the last 25 years
>>(with roots that go back 50 or more, probably) has been the superiority of
>>"shrinkage" methods over variable selection. I also find it distressing that
>>these ideas have apparently not penetrated much (at all?) into the wider
>>scientific community (but I suppose I shouldn't be surprised -- most
>>scientists still do one factor at a time experiments 80 years after Fisher).
>>Specific incarnations can be found in anything Bayesian, mixed effects
>>models for repeated measures, ridge regression, and the R packages lars and
>>lasso, among others.
>>
>>I would speculate that aside from the usual statistics/science cultural
>>issues, part of the reason for this is that the estimators don't generally
>>come with neat, classical inference procedures: like it or not, many
>>scientists have been conditioned by their Stat 101 courses to expect P
>>values, so in some sense, we are hoisted by our own petard.
>>
>>Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
>>
>>-- Bert Gunter
>> 
>>
>>
>>>-----Original Message-----
>>>From: r-help-bounces at stat.math.ethz.ch 
>>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Frank 
>>>E Harrell Jr
>>>Sent: Wednesday, March 02, 2005 5:13 AM
>>>To: Wittner, Ben
>>>Cc: r-help at lists.R-project.org
>>>Subject: Re: [R] subset selection for logistic regression
>>>
>>>Wittner, Ben wrote:
>>>
>>>>R-packages leaps and subselect implement various methods of 
>>>
>>>selecting best or
>>>
>>>>good subsets of predictor variables for linear regression 
>>>
>>>models, but they do
>>>
>>>>not seem to be applicable to logistic regression models.
>>>> 
>>>>Does anyone know of software for finding good subsets of 
>>>
>>>predictor variables for
>>>
>>>>linear regression models?
>>>> 
>>>>Thanks.
>>>> 
>>>>-Ben
>>>
>>>Why are these procedures still being used?  The performance 
>>>is known to 
>>>be bad in almost every sense (see r-help archives).
>>>
>>>Frank Harrell
>>>
>>>
>>>> 
>>>>p.s., The leaps package references "Subset Selection in 
>>>
>>>Regression" by Alan
>>>
>>>>Miller. On page 2 of the
>>>>2nd edition of that text it states the following:
>>>> 
>>>>  "All of the models which will be considered in this 
>>>
>>>monograph will be linear;
>>>
>>>>that is they
>>>>   will be linear in the regression coefficients.Though 
>>>
>>>most of the ideas and
>>>
>>>>problems carry
>>>>   over to the fitting of nonlinear models and generalized 
>>>
>>>linear models
>>>
>>>>(particularly the fitting
>>>>   of logistic relationships), the complexity is greatly increased."
>>>
>>>
>>>-- 
>>>Frank E Harrell Jr   Professor and Chair           School of Medicine
>>>                      Department of Biostatistics   
>>>Vanderbilt University
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide! 
>>>http://www.R-project.org/posting-guide.html
>>>
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
> 
> 
> ***********************************************************************
> Christian Hennig
> Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
>>From 1 April 2005: Department of Statistical Science, UCL, London
> #######################################################################
> ich empfehle www.boag-online.de
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list