[R] subset selection for logistic regression

Christian Hennig fm3a004 at math.uni-hamburg.de
Wed Mar 2 18:10:46 CET 2005


Perhaps I should not write it because I will discredit myself with this
but...

Suppose I have a setup with 100 variables and some 1000 cases and I want to
boil down the number of variables to a maximum of 10 for practical reasons
even if I lose 10% prediction quality by this (for example because it is
expensive to measure all variables on new cases).  

Is it really so wrong to use a stepwise method?
Let's say I divide the sample into three parts and do variable selction on
the first part, estimation on the second and test on the third part (this
solves almost all problems Frank is talking about on p. 56/57 in his
excellent book). Is there always a tractable alternative? 

Of course it is wrong to interpret the selected variables as "the true
influences" and all others as "unrelated", but if I don't do that?

If it should really be a taboo to do stepwise variable selection, why are p.
58/59 of "Regression Modeling Strategies" devoted to "how to do it of you
must"?

Please forget my name;-)

Christian

On Wed, 2 Mar 2005, Berton Gunter wrote:

> To clarify Frank's remark ...
> 
> A prominent theme in statistical research over at least the last 25 years
> (with roots that go back 50 or more, probably) has been the superiority of
> "shrinkage" methods over variable selection. I also find it distressing that
> these ideas have apparently not penetrated much (at all?) into the wider
> scientific community (but I suppose I shouldn't be surprised -- most
> scientists still do one factor at a time experiments 80 years after Fisher).
> Specific incarnations can be found in anything Bayesian, mixed effects
> models for repeated measures, ridge regression, and the R packages lars and
> lasso, among others.
> 
> I would speculate that aside from the usual statistics/science cultural
> issues, part of the reason for this is that the estimators don't generally
> come with neat, classical inference procedures: like it or not, many
> scientists have been conditioned by their Stat 101 courses to expect P
> values, so in some sense, we are hoisted by our own petard.
> 
> Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
> 
> -- Bert Gunter
>  
> 
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch 
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Frank 
> > E Harrell Jr
> > Sent: Wednesday, March 02, 2005 5:13 AM
> > To: Wittner, Ben
> > Cc: r-help at lists.R-project.org
> > Subject: Re: [R] subset selection for logistic regression
> > 
> > Wittner, Ben wrote:
> > > R-packages leaps and subselect implement various methods of 
> > selecting best or
> > > good subsets of predictor variables for linear regression 
> > models, but they do
> > > not seem to be applicable to logistic regression models.
> > >  
> > > Does anyone know of software for finding good subsets of 
> > predictor variables for
> > > linear regression models?
> > >  
> > > Thanks.
> > >  
> > > -Ben
> > 
> > Why are these procedures still being used?  The performance 
> > is known to 
> > be bad in almost every sense (see r-help archives).
> > 
> > Frank Harrell
> > 
> > >  
> > > p.s., The leaps package references "Subset Selection in 
> > Regression" by Alan
> > > Miller. On page 2 of the
> > > 2nd edition of that text it states the following:
> > >  
> > >   "All of the models which will be considered in this 
> > monograph will be linear;
> > > that is they
> > >    will be linear in the regression coefficients.Though 
> > most of the ideas and
> > > problems carry
> > >    over to the fitting of nonlinear models and generalized 
> > linear models
> > > (particularly the fitting
> > >    of logistic relationships), the complexity is greatly increased."
> > 
> > 
> > -- 
> > Frank E Harrell Jr   Professor and Chair           School of Medicine
> >                       Department of Biostatistics   
> > Vanderbilt University
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

***********************************************************************
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
>From 1 April 2005: Department of Statistical Science, UCL, London
#######################################################################
ich empfehle www.boag-online.de




More information about the R-help mailing list