[R] variable selection in logistic

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Sep 3 19:11:14 CEST 2009


annie Zhang wrote:
> Thank you for all your reply.
> Actually as Bert said, besides predicion, I also need variable selection 
> (I need to know which variables are important). As far as the sample 
> size and number of variables, both of them are small around 35. How can 
> I get accurate prediction as long as good predictors?
> Annie

It is next to impossible to find a unique list of 'important' variables 
without having 50 times as many subjects as potential predictors, unless 
your signal:noise ratio is stunning.

Frank

> 
> On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter <gunter.berton at gene.com 
> <mailto:gunter.berton at gene.com>> wrote:
> 
>     But let's be clear here folks:
> 
>     Ben's comment is apropos: ""As many variables as samples" is
>     particularly
>     scary."
> 
>     (Aside -- how much scarier then are -omics analyses in which the
>     number of
>     variables is thousands of times the number of samples?)
> 
>     Sensible penalization (it's usually not too sensitive to the details) is
>     only another way of obtaining a parsimonious model with good (in the
>     sense
>     of minimizing overall prediction error: bias + variance) prediction
>     properties. Alas, this is often not what scientists want: they use
>     variable
>     selection to find the "right" covariates, the "most important" variables
>     affecting the response. But this is beyond the power of empirical
>     modeling
>     here: "as many variables as samples" almost guarantees that there
>     will be
>     many different and even nonoverlapping subsets of variables that
>     are, within
>     statistical noise, equally "optimal" predictors. That is, variable
>     selection
>     in such circumstances is just a pretty sophisticated random number
>     generator
>     -- ergo Frank's Draconian warnings. Penalization produces better
>     prediction
>     engines with better properties, but it cannot overcome the "as many
>     variables as samples" problem either. Entropy rules. If what is
>     sought is a
>     way to determine the "truly important" variables, then the study must be
>     designed to provide the information to do so. You don't get
>     something for
>     nothing.
> 
>     Cheers,
> 
>     Bert Gunter
>     Genentech Nonclinical Biostatistics
> 
> 
>     -----Original Message-----
>     From: r-help-bounces at r-project.org
>     <mailto:r-help-bounces at r-project.org>
>     [mailto:r-help-bounces at r-project.org
>     <mailto:r-help-bounces at r-project.org>] On
>     Behalf Of Frank E Harrell Jr
>     Sent: Wednesday, September 02, 2009 9:07 PM
>     To: annie Zhang
>     Cc: r-help at r-project.org <mailto:r-help at r-project.org>
>     Subject: Re: [R] variable selection in logistic
> 
>     annie Zhang wrote:
>      > Hi, Frank,
>      >
>      > You mean the backward and forward stepwise selection is bad? You also
>      > suggest the penalized logistic regression is the best choice? Is
>     there
>      > any function to do it as well as selecting the best penalty?
>      >
>      > Annie
> 
>     All variable selection is bad unless its in the context of penalization.
>      You'll need penalized logistic regression not necessarily with
>     variable selection, for example a quadratic penalty as in a case study
>     in my book, or an L1 penalty (lasso) using other packages.
> 
>     Frank
> 
>      >
>      > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
>      > <f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>
>     <mailto:f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>>>
>     wrote:
>      >
>      >     David Winsemius wrote:
>      >
>      >
>      >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
>      >
>      >             Hi, R users,
>      >
>      >             What may be the best function in R to do variable
>     selection
>      >             in logistic
>      >             regression?
>      >
>      >
>      >         PhD theses, and books by famous statisticians have been
>     pursuing
>      >         the answer to that question for decades.
>      >
>      >             I have the same number of variables as the number of
>     samples,
>      >             and I want to select the best variablesfor prediction. Is
>      >             there any function
>      >             doing forward selection followed by backward
>     elimination in
>      >             stepwise
>      >             logistic regression?
>      >
>      >
>      >         You should probably be reading up on penalized regression
>      >         methods. The stepwise procedures reporting unadjusted
>      >         "significance" made available by SAS and SPSS to the unwary
>      >         neophyte user have very poor statistical properties.
>      >
>      >         --
>      >
>      >         David Winsemius, MD
>      >
>      >
>      >     Amen to that.
>      >
>      >     Annie, resist the temptation.  These methods bite.
>      >
>      >     Frank
>      >
>      >
>      >         Heritage Laboratories
>      >         West Hartford, CT
>      >
>      >         ______________________________________________
>      >         R-help at r-project.org <mailto:R-help at r-project.org>
>     <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list
>      >         https://stat.ethz.ch/mailman/listinfo/r-help
>      >         PLEASE do read the posting guide
>      >         http://www.R-project.org/posting-guide.html
>     <http://www.r-project.org/posting-guide.html>
>      >         <http://www.r-project.org/posting-guide.html>
>      >         and provide commented, minimal, self-contained,
>     reproducible code.
>      >
>      >
>      >
>      >     --
>      >     Frank E Harrell Jr   Professor and Chair           School of
>     Medicine
>      >                         Department of Biostatistics   Vanderbilt
>     University
>      >
>      >
> 
> 
>     --
>     Frank E Harrell Jr   Professor and Chair           School of Medicine
>                          Department of Biostatistics   Vanderbilt University
> 
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.r-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list