[R] Logistic Regression: variable selection based on p value?
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Dec 4 14:53:03 CET 2008
pufftissue pufftissue wrote:
> When I use logistic regression, each variable has a p value associated with
> it. Do I only include the variables that have a statistically significant p
> value (<0.05), or are there situations when I should include variables when
> their p values are high? I had heard that if a variable has a high p value
> but it's not the terminal variable, keep it; otherwise, take it out. Not
> sure if it's right or even why this is the case. What about if my p values
> are terrible but this combo of variables yields the highest AUC and
> calibration? What prevails in this case?
> Thank you!
It depends on your goals, but in general problems caused by stepwise
regression arise from using P-value cutoffs that are too small rather
than cutoffs that are too large. There are many reasons not to remove
any variables, if you want valid confidence intervals and P-values and
discrimination indexes. Note that AUC is not a great objective
function; that's why we have the log likelihood.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help