[R] Logistic regression X^2 test with large sample size (fwd)

David Winsemius dwinsemius at comcast.net
Tue Jul 31 21:22:51 CEST 2012

On Jul 31, 2012, at 10:25 AM, M Pomati wrote:

> Marc, thank you very much for your help.
> I've posted in on
> <http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models 
> >
> and added details.

I think you might have gotten a more statistically knowledgeable  
audience at:


(And I suggested to the moderators at math-SE that it be migrated.)


> Many thanks
> Marco
> --On 31 July 2012 11:50 -0500 Marc Schwartz <marc_schwartz at me.com>  
> wrote:
>> On Jul 31, 2012, at 10:35 AM, M Pomati <Marco.Pomati at bristol.ac.uk>  
>> wrote:
>>> Does anyone know of any X^2 tests to compare the fit of logistic  
>>> models
>>> which factor out the sample size? I'm dealing with a very large  
>>> sample and
>>> I fear the significant X^2 test I get when adding a variable to  
>>> the model
>>> is simply a result of the sample size (>200,000 cases).
>>> I'd rather use the whole dataset instead of taking (small) random  
>>> samples
>>> as it is highly skewed. I've seen things like Phi and Cramer's V for
>>> crosstabs but I'm not sure whether they have been used before on  
>>> logistic
>>> regression, if there are better ones and if there are any packages.
>>> Many thanks
>>> Marco
>> Sounds like you are bordering on some type of stepwise approach to
> including or not including covariates in the model. You can search  
> the list
> archives for a myriad of discussions as to why that is a poor  
> approach.
>> You have the luxury of a large sample. You also have the challenge of
> interpreting covariates that appear to be statistically significant,  
> but
> may have a rather small *effect size* in context. That is where  
> subject
> matter experts need to provide input as to interpretation of the  
> contextual
> significance of the variable, as opposed to the statistical  
> significance of
> that same variable.
>> A general approach, is to simply pre-specify your model based upon  
>> rather
> simple considerations. Also, you need to determine if your goal for  
> the
> model is prediction or explanation.
>> What is the incidence of your 'event' in the sample? If it is say  
>> 10%,
> then you should have around 20,000 events. The rule of thumb for  
> logistic
> regression is to have around 20 events per covariate degree of  
> freedom (df)
> to minimize the risk of over-fitting the model to your dataset. A
> continuous covariate is 1 df, a k-level factor is k-1 df. So with  
> 20,000
> events, your model could feasibly have 1,000 covariate df's. I am  
> guessing
> that you don't have that much independent data to begin with.
>> So, pre-specfy your model on the full dataset and stick with it.  
>> Interact
> with subject matter experts on the interpretation of the model.
>> BTW, this question is really about statistical modeling generally,  
>> not
> really R specific. Such queries are best posed to general statistical
> lists/forums such as Stack Exchange. I would also point you to Frank
> Harrell's book, Regression Modeling Strategies.
>> Regards,
>> Marc Schwartz
> ----------------------
> M Pomati
> University of Bristol

David Winsemius, MD
Alameda, CA, USA

More information about the R-help mailing list