[R] regression methods for rare events?

Marc Schwartz marc_schwartz at me.com
Mon Jun 4 23:27:53 CEST 2012


On Jun 4, 2012, at 3:47 PM, David Studer wrote:

> Hi everybody!
> 
> I have a sample with n=2.000. This sample contains rare events (10, 20, 30
> individuals with a specific illness).
> Now I'd like to do a logistic regression in order to identify risk factors.
> I have several independent variables on an interval
> scale.
> 
> Does anyone know whether the number of these rare events is sufficient in
> order to calculate a multivariate
> logistic regression? Or are there any alternative models I should use?
> (which are available in R)
> 
> Thank you very much any advice!
> David



The quick answer is yes you can, but you will be very limited in how many covariates you can include in each of the respective models.

You are looking at event rates of 0.5%, 1.0% and 1.5% which in my experience are not truly "rare", per se. We had a recent post with an event rate on the order of 0.006%, albeit with millions of records. That is rare... :-)

Typical "rules of thumb", to avoid over-fitting for LR models would suggest that you should have between 10 and 20 "events" per covariate degree of freedom. A continuous covariate would be 1 df, an N-level factor would be N-1 df.

With your sample and the number of events, you would be limited to perhaps no more than 2 or 3 covariate df and even then you should give consideration to using penalization to avoid over-fitting.

Two references that would be helpful to you are:

Frank's "Regression Modeling Strategies" book:
http://www.amazon.com/exec/obidos/ASIN/0387952322/

There is a helpful and updated PDF download here:

  http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/rms.pdf

and I would focus, in your case, on the use of the lrm() function in Frank's rms CRAN package, along with related tools for penalization and validation.

Also, Steyerberg's "Clinical Prediction Models" book:
http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X

which is an excellent reference and has relevant examples using R.

The rumor is that Frank is working on a new edition of his book with a greater focus on the use of R and is due RSN. Perhaps there will be copies at useR in Nashville next week? One could hope... :-)

Regards,

Marc Schwartz



More information about the R-help mailing list