[R] How to improve, at all, a simple GLM code

Thu Mar 29 23:19:00 CEST 2012

Abigail Clifton <abigailclifton <at> me.com> writes:

> I am trying to fit a logit model to some data in a CSV file in R.

 It would be helpful to link back to your previous question:

http://thread.gmane.org/gmane.comp.lang.r.general/259353

> Here is my code:
> 
> Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE)
> Prepared_Data
> attach(Prepared_Data)
> lrfit<-glm(C3~A1*B2*D4*E5,family = binomial)
> anova(lrfit, test="Chisq")
> write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv")
> shell.exec("CWModelA.csv")

  This is still not a reproducible example, although
it's a little closer.  Did you read the "recommended reading"
in my previous answer???

> I am unsure as to how many methods there are of choosing a suitable model, 

 Lots, and it depends very much on why you are doing the analysis in
the first place.  Are you (1) trying to find a good predictive model?
(2) Looking for interesting patterns in the data?  (3) Trying to test
hypotheses about which predictors have a significant effect on the
outcome?  (4) Partition the variance explained by different predictors?

> however, I was hoping to fit the
> full/saturated model and choose the significant terms only as 
> my final model.

  In general this is a poor choice for goal #1 above, not necessarily
bad for #2, absolutely terrible for #3, irrelevant for #4.  I'm
guessing you are interested in the best predictive model, since you
mentioned something in your previous message about working out the
probability of default on loan applications.  I would say your best
bet is to use penalized approaches (see the glmnet package, and
library("sos"); findFn("lasso")).

> My first question therefore: is there a better way to fit a model to
> some data? Is there a function or way of getting R to print the
> optimum model?

> My CSV file, when opened in excel, contains approximately 3500 rows
> x 27 columns. I can only seem to run 'anova()' on the saturated/full
> model including the first four columns/factors. If I take any more
> into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops
> responding/I have to force quit. Why is this? How can I get around
> it as I need to include all 27 columns?

   For continuous predictors, the number of parameters of the
saturated models grows as 2^n; 2^27 is >134 million, so you
probably don't want to do that.  It's potentially even worse
for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12
for three-level predictors).

  It's still not sufficiently clear why you're having a problem 
because you haven't given enough information: in the example I
gave in my previous answer, I used 7 continuous variables for
128 parameters without too much difficulty, but if you had (say)
5 levels for each of 7 predictors then you would be trying
to estimate 78125 parameters ...

  Bottom line, it may simply not be reasonable to fit the
saturated model.  Hard-core machine learning approaches (and
*maybe* the penalized regression approaches) might be able
to handle a few thousand predictors for n=3500, but a model
with tens of thousands of parameters (or more) feels somewhat crazy.
(Someone else is welcome to tell me how this could be done.)

  Ben Bolker