[R] Logistic regression goodness of fit tests

Thu Mar 10 22:36:09 CET 2005

On Thu, 10 Mar 2005, Trevor Wiens wrote:

> I was unsure of what suitable goodness-of-fit tests existed in R for
> logistic regression. After searching the R-help archive I found that
> using the Design models and resid, could be used to calculate this as
> follows:
> 
> d <- datadist(mydataframe)
> options(datadist = 'd')
> fit <- lrm(response ~ predictor1 + predictor2..., data=mydataframe, x =T, y=T)
> resid(fit, 'gof').
> 
> I set up a script to first use glm to create models use stepAIC to
> determine the optimal model. I used this instead of fastbw because I
> found the AIC values to be completely different and the final models
> didn't always match. Then my script takes the reduced model formula and
> recreates it using lrm as above. Now the problem is that for some models
> I run into an error to which I can find no reference whatsoever on the
> mailing list or on the web. It is as follows:
> 
> test.lrm <- lrm(cclo ~ elev + aspect + cti_var + planar + feat_div + loamy + sands + sandy + wet + slr_mean, data=datamatrix, x = T, y = T)
> singular information matrix in lrm.fit (rank= 10 ).  Offending variable(s):
> slr_mean 
> Error in j:(j + params[i] - 1) : NA/NaN argument
> 
> 
> Now if I add the singularity criterion and make the value smaller than
> the default of 1E-7 to 1E-9 or 1E-12 which is the default in calibrate,
> it works. Why is that?
> 
> Not being a statistician but a biogeographer using regression as a tool,
> I don't really understand what is happening here.

>From one geographer to another, and being prepared to bow to
better-founded explanations, you seem to have included a variable - the
offending variable slr_mean - that is very highly correlated with another.  
Making the tolerance tighter says that you are prepared to take the risk
of confounding your results. You've already "been fishing" for right hand
side variables anyway, so your results are somewhat prejudiced, aren't
they?

I think you may also like to review which of the right hand side variables
should be treated as factors rather than numeric (looking at the summary
suggests that many are factors), and perhaps the dependent variable too,
although lrm() seems to take care of this if you haven't.

> 
> Does changing the tol variable, change how I should interpret
> goodness-of-fit results or other evaluations of the models created?
> 
> I've included a summary of the data below (in case it might be helpful)
> with all variables in the data frame as it was easier than selecting out
> the ones used in the model.
> 
> Thanks in advance.
> 
> T
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: Roger.Bivand at nhh.no