[R] How to improve, at all, a simple GLM code
bbolker at gmail.com
Thu Mar 29 23:19:00 CEST 2012
Abigail Clifton <abigailclifton <at> me.com> writes:
> I am trying to fit a logit model to some data in a CSV file in R.
It would be helpful to link back to your previous question:
> Here is my code:
> Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE)
> lrfit<-glm(C3~A1*B2*D4*E5,family = binomial)
> anova(lrfit, test="Chisq")
> write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv")
This is still not a reproducible example, although
it's a little closer. Did you read the "recommended reading"
in my previous answer???
> I am unsure as to how many methods there are of choosing a suitable model,
Lots, and it depends very much on why you are doing the analysis in
the first place. Are you (1) trying to find a good predictive model?
(2) Looking for interesting patterns in the data? (3) Trying to test
hypotheses about which predictors have a significant effect on the
outcome? (4) Partition the variance explained by different predictors?
> however, I was hoping to fit the
> full/saturated model and choose the significant terms only as
> my final model.
In general this is a poor choice for goal #1 above, not necessarily
bad for #2, absolutely terrible for #3, irrelevant for #4. I'm
guessing you are interested in the best predictive model, since you
mentioned something in your previous message about working out the
probability of default on loan applications. I would say your best
bet is to use penalized approaches (see the glmnet package, and
> My first question therefore: is there a better way to fit a model to
> some data? Is there a function or way of getting R to print the
> optimum model?
> My CSV file, when opened in excel, contains approximately 3500 rows
> x 27 columns. I can only seem to run 'anova()' on the saturated/full
> model including the first four columns/factors. If I take any more
> into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops
> responding/I have to force quit. Why is this? How can I get around
> it as I need to include all 27 columns?
For continuous predictors, the number of parameters of the
saturated models grows as 2^n; 2^27 is >134 million, so you
probably don't want to do that. It's potentially even worse
for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12
for three-level predictors).
It's still not sufficiently clear why you're having a problem
because you haven't given enough information: in the example I
gave in my previous answer, I used 7 continuous variables for
128 parameters without too much difficulty, but if you had (say)
5 levels for each of 7 predictors then you would be trying
to estimate 78125 parameters ...
Bottom line, it may simply not be reasonable to fit the
saturated model. Hard-core machine learning approaches (and
*maybe* the penalized regression approaches) might be able
to handle a few thousand predictors for n=3500, but a model
with tens of thousands of parameters (or more) feels somewhat crazy.
(Someone else is welcome to tell me how this could be done.)
More information about the R-help