[R] Logistic regression X^2 test with large sample size (fwd)
David Winsemius
dwinsemius at comcast.net
Tue Jul 31 21:22:51 CEST 2012
On Jul 31, 2012, at 10:25 AM, M Pomati wrote:
> Marc, thank you very much for your help.
> I've posted in on
>
> <http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models
> >
>
> and added details.
I think you might have gotten a more statistically knowledgeable
audience at:
http://stats.stackexchange.com/
(And I suggested to the moderators at math-SE that it be migrated.)
--
David.
>
> Many thanks
>
> Marco
>
> --On 31 July 2012 11:50 -0500 Marc Schwartz <marc_schwartz at me.com>
> wrote:
>
>> On Jul 31, 2012, at 10:35 AM, M Pomati <Marco.Pomati at bristol.ac.uk>
>> wrote:
>>
>>> Does anyone know of any X^2 tests to compare the fit of logistic
>>> models
>>> which factor out the sample size? I'm dealing with a very large
>>> sample and
>>> I fear the significant X^2 test I get when adding a variable to
>>> the model
>>> is simply a result of the sample size (>200,000 cases).
>>>
>>> I'd rather use the whole dataset instead of taking (small) random
>>> samples
>>> as it is highly skewed. I've seen things like Phi and Cramer's V for
>>> crosstabs but I'm not sure whether they have been used before on
>>> logistic
>>> regression, if there are better ones and if there are any packages.
>>>
>>>
>>> Many thanks
>>>
>>> Marco
>>
>>
>> Sounds like you are bordering on some type of stepwise approach to
> including or not including covariates in the model. You can search
> the list
> archives for a myriad of discussions as to why that is a poor
> approach.
>>
>> You have the luxury of a large sample. You also have the challenge of
> interpreting covariates that appear to be statistically significant,
> but
> may have a rather small *effect size* in context. That is where
> subject
> matter experts need to provide input as to interpretation of the
> contextual
> significance of the variable, as opposed to the statistical
> significance of
> that same variable.
>>
>> A general approach, is to simply pre-specify your model based upon
>> rather
> simple considerations. Also, you need to determine if your goal for
> the
> model is prediction or explanation.
>>
>> What is the incidence of your 'event' in the sample? If it is say
>> 10%,
> then you should have around 20,000 events. The rule of thumb for
> logistic
> regression is to have around 20 events per covariate degree of
> freedom (df)
> to minimize the risk of over-fitting the model to your dataset. A
> continuous covariate is 1 df, a k-level factor is k-1 df. So with
> 20,000
> events, your model could feasibly have 1,000 covariate df's. I am
> guessing
> that you don't have that much independent data to begin with.
>>
>> So, pre-specfy your model on the full dataset and stick with it.
>> Interact
> with subject matter experts on the interpretation of the model.
>>
>> BTW, this question is really about statistical modeling generally,
>> not
> really R specific. Such queries are best posed to general statistical
> lists/forums such as Stack Exchange. I would also point you to Frank
> Harrell's book, Regression Modeling Strategies.
>>
>> Regards,
>>
>> Marc Schwartz
>>
> ----------------------
> M Pomati
> University of Bristol
>
David Winsemius, MD
Alameda, CA, USA
More information about the R-help
mailing list