[R] Logistic regression problem
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Wed Oct 1 01:56:45 CEST 2008
Bernardo Rangel Tura wrote:
> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
>> I have a huge data set with thousands of variable and one binary
>> variable. I know that most of the variables are correlated and are not
>> good predictors... but...
>> It is very hard to start modeling with such a huge dataset. What would
>> be your suggestion. How to make a first cut... how to eliminate most
>> of the variables but not to ignore potential interactions... for
>> example, maybe variable A is not good predictor and variable B is not
>> good predictor either, but maybe A and B together are good
>> Any suggestion is welcomed
> I think do you start with a rpart("binary variable"~.)
> This show you a set of variables to start a model and the start set to
> curoff for continous variables
I cannot imagine a worse way to formulate a regression model. Reasons
1. Results of recursive partitioning are not trustworthy unless the
sample size exceeds 50,000 or the signal to noise ratio is extremely high.
2. The type I error of tests from the final regression model will be
3. False interactions will appear in the model.
4. The cutoffs so chosen will not replicate and in effect assume that
covariate effects are discontinuous and piecewise flat. The use of
cutoffs results in a huge loss of information and power and makes the
analysis arbitrary and impossible to interpret (e.g., a high covariate
value:low covariate value odds ratio or mean difference is a complex
function of all the covariate values in the sample).
5. The model will not validate in new data.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help