[R] FW: logistic regression

Robert A LaBudde ral at lcfltd.com
Mon Sep 29 07:04:01 CEST 2008


At the risk of my also spending time in the "Inferno", I would 
suggest your problem resembles principal components analysis or 
factor analysis. In this, you would look for a set of linear 
transforms of your variables that have a smaller dimensionality, but 
nearly the same spanned subspace.

Before you embark on any of this, you should ask what you are 
interested in: 1) A physical model that can be interpreted, and may 
hold true in future experiments; or 2) A numerical representation of 
your data that interpolates it. For the former, there is no 
substitute for expert knowledge in formulating models, and then you 
can see if they are in discord with your data. For the latter, the 
PCA approach can condense your predictor set and avoid collinearity.

At 10:49 PM 9/28/2008, Darin Brooks wrote:
>Wow.  I had no idea.  I was told to be wary ... But nothing this bold.
>
>I appreciate your straight forward advice.
>
>I will be exploring the R packages: rpart, earth, and gbm.  Dr Elith has
>generously provided me with literature and R support in the boosted
>regression tree arena.   I will leave stepwise logistic regression alone.
>
>Any parting advice regarding narrowing down the variables from the unruly 44
>to about 8 or 10?  (In addition to your advice regarding redundancy analysis
>and penalized maximum likelihood estimation).
>
>And I visited your website Dr. Harrell.  A LOT of help there.  I will also
>be purchasing your book this week.  Wish I would have stumbled on this forum
>a year ago.
>
>Thanks again.
>
>-----Original Message-----
>From: Frank E Harrell Jr [mailto:f.harrell at vanderbilt.edu]
>Sent: Sunday, September 28, 2008 8:23 PM
>To: Darin Brooks
>Cc: 'Bert Gunter'; r-help at r-project.org
>Subject: Re: [R] FW: logistic regression
>
>
>Darin Brooks wrote:
> > I certainly appreciate your comments, Bert.  It is abundantly clear
> > that I won't be invited to any of the cocktail parties hosted by the
> > "polite circles".  I am not a statistician.  I am merely a geographer
> > (in the field of ecology) trying to develop a predictor to assist in a
> > forestry-based decision making process.  My work in the natural world
> > has taught me that NOTHING is predictable ... and the very idea of a
> > bullet-proof ecological predictive model is doomed to fail.
> > That said, there ARE some basic predictors that assist foresters in
> > their salvage decisions.  They use these on a daily basis.  The
> > problem is that most of the evidence and modeling is anecdotal.  There
> > really are no models in the field that I am working in.  And for good
> > reason ... The natural world isn't interested in being modeled.  I
> > think we can all agree on this - guru or not.
> > But even the most basic predictive model (using only the GIS/mappable
> > data that is readily available to most users) is a starting point.
> > The resultant
> > dataset(s) of this potential model will be followed-up and field verified.
> > Providing this simple starting point (or catalyst if you will)could
> > potentially save A LOT of time and money.
> > What I need to do is to isolate the best available variables into a
> > model and assign a confidence to it.  It doesn't have to change
> > everyone's world ... it just has to change the way of thinking in my small
>little world.
> > These past few days have been an education for me in the subject of
> > stepwise regression.  I approach it with much more apprehension now.
> > So if nothing else good comes of this discussion/exercise/experience
> > ... I've learned something.
> >
> > Darin Brooks
>
>Darin,
>
>I think the point is that the confidence you can assign to the "best
>available variables" is zero.  That is the probability that stepwise
>variable selection will select the correct variables.
>
>It is probably better to build a model based on the knowledge in the field
>you alluded to, rather than to use P-values to decide.
>
>Frank Harrell
>
>
> >
> > -----Original Message-----
> > From: Bert Gunter [mailto:gunter.berton at gene.com]
> > Sent: Sunday, September 28, 2008 6:26 PM
> > To: 'David Winsemius'; 'Darin Brooks'
> > Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
> > Subject: RE: [R] FW: logistic regression
> >
> >
> > The Inferno awaits me -- but I cannot resist a comment (but DO look at
> > Frank's website).
> >
> > There is a deep and disconcerting dissonance here. Scientists are
> > (naturally) interested in getting at mechanisms, and so want to know
> > which of the variables "count" and which do not. But statistical
> > analysis --
> > **any** statistical analysis -- cannot tell you that. All statistical
> > analysis can do is build models that give good predictions (and only
> > over the range of the data). The models you get depend **both** on the
> > way Nature works **and** the peculiarities of your data (which is what
> > Frank referred to in his comment on data reduction). In fact, it is
> > highly likely that with your data there are many alternative
> > prediction equations built from different collections of covariates that
>perform essentially equally well.
> > Sometimes it is otherwise, typically when prospective, carefully
> > designed studies are performed -- there is a reason that the FDA
> > insists on clinical trials, after all (and reasons why such studies
> > are difficult and expensive to do!).
> >
> > The belief that "data mining" (as it is known in the polite circles
> > that Frank obviously eschews) is an effective (and even automated!)
> > tool for discovering how Nature works is a misconception, but one that
> > for many reasons is enthusiastically promoted.  If you are looking
> > only to predict, it may do; but you are deceived if you hope for
> > Truth. Can you get hints? -- well maybe, maybe not. Chaos beckons.
> >
> > I think many -- maybe even most -- statisticians rue the day that
> > stepwise regression was invented and certainly that it has been
> > marketed as a tool for winnowing out the "important" few variables
> > from the blizzard of "irrelevant" background noise. Pogo was right: "
> > We have seen the enemy -- and it is us."
> >
> > (As I said, the Inferno awaits...)
> >
> > Cheers to all,
> > Bert Gunter
> >
> > DEFINITELY MY OWN OPINIONS HERE!
> >
> >
> >
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius
> > Sent: Saturday, September 27, 2008 5:34 PM
> > To: Darin Brooks
> > Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
> > Subject: Re: [R] FW: logistic regression
> >
> > It's more a statement that it expresses a statistical perspective very
> > succinctly, somewhat like a Zen koan.  Frank's book,"Regression
> > Modeling Strategies", has entire chapters on reasoned approaches to your
>question.
> > His website also has quite a bit of material free for the taking.
> >
> > --
> > David Winsemius
> > Heritage Laboratories
> >
> > On Sep 27, 2008, at 7:24 PM, Darin Brooks wrote:
> >
> >> Glad you were amused.
> >>
> >> I assume that "booking this as a fortune" means that this was an
> >> idiotic way to model the data?
> >>
> >> MARS?  Boosted Regression Trees?  Any of these a better choice to
> >> extract significant predictors (from a list of about 44) for a
> >> measured dependent variable?
> >>
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org
> >> ] On
> >> Behalf Of Ted Harding
> >> Sent: Saturday, September 27, 2008 4:30 PM
> >> To: r-help at stat.math.ethz.ch
> >> Subject: Re: [R] FW: logistic regression
> >>
> >>
> >>
> >> On 27-Sep-08 21:45:23, Dieter Menne wrote:
> >>> Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
> >>>
> >>>> Estimates from this model (and especially standard errors and
> >>>> P-values)
> >>>> will be invalid because they do not take into account the stepwise
> >>>> procedure above that was used to torture the data until they
> >>>> confessed.
> >>>>
> >>>> Frank
> >>> Please book this as a fortune.
> >>>
> >>> Dieter
> >> Seconded!
> >> Ted.
> >>
>
>--
>Frank E Harrell Jr   Professor and Chair           School of Medicine
>                       Department of Biostatistics   Vanderbilt University
>No virus found in this incoming message.
>Checked by AVG - http://www.avg.com
>
>1:11 PM
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"



More information about the R-help mailing list