[R] Data handling/optimum glm method.

Thu Mar 29 14:41:21 CEST 2012

 <abigailclifton <at> me.com> writes:

> I am trying to fit a generalised linear model to some loan
> application and default data. The purpose of this is to eventually
> work out the probability an applicant will default.

> However, R seems to crash or die when I run "glm" on anything
>  greater than a 5-way saturated model for my data.

  What does "crash or die" mean?  Are you getting error messages?
What are they? Is the R application actually quitting?

> My first question: is the best way to fit a generalised linear model
> in R to fit the saturated model and extract the significant terms
> only, or to start at the null model and to work up to the optimum
> one?

  This is more of a statistical practice question than an R question.
Opinions differ but in general I would say if it is computationally
feasible that you should start (and maybe finish) with the 
full model.

> I am importing a csv file with 3500 rows and 27 columns (3500x27 matrix).

> My second question: is there anyway to increase the memory 
> I have so R can cope with more analysis?

   help("Memory-limits")
> 
> I can send my code if it would help to answer the question.

  Please read the posting guide (link at the bottom of every R-help
posting) and follow its advice.  We don't know enough about your
situation to help.  You could also try reading 
http://tinyurl.com/reproducible-000 ...

  This works for me:

z <- matrix(rnorm(3500*27),ncol=27)
y <- sample(0:1,replace=TRUE,size=3500)
colnames(z) <- c(letters,"A")
d <- data.frame(y=y,z)
gg <- glm(y~.,data=d,family="binomial")
gg <- glm(y~a*b*c*d*e*f*g*h,data=d,family="binomial")