[R] Logistic regression X^2 test with large sample size (fwd)
dwinsemius at comcast.net
Tue Jul 31 21:22:51 CEST 2012
On Jul 31, 2012, at 10:25 AM, M Pomati wrote:
> Marc, thank you very much for your help.
> I've posted in on
> and added details.
I think you might have gotten a more statistically knowledgeable
(And I suggested to the moderators at math-SE that it be migrated.)
> Many thanks
> --On 31 July 2012 11:50 -0500 Marc Schwartz <marc_schwartz at me.com>
>> On Jul 31, 2012, at 10:35 AM, M Pomati <Marco.Pomati at bristol.ac.uk>
>>> Does anyone know of any X^2 tests to compare the fit of logistic
>>> which factor out the sample size? I'm dealing with a very large
>>> sample and
>>> I fear the significant X^2 test I get when adding a variable to
>>> the model
>>> is simply a result of the sample size (>200,000 cases).
>>> I'd rather use the whole dataset instead of taking (small) random
>>> as it is highly skewed. I've seen things like Phi and Cramer's V for
>>> crosstabs but I'm not sure whether they have been used before on
>>> regression, if there are better ones and if there are any packages.
>>> Many thanks
>> Sounds like you are bordering on some type of stepwise approach to
> including or not including covariates in the model. You can search
> the list
> archives for a myriad of discussions as to why that is a poor
>> You have the luxury of a large sample. You also have the challenge of
> interpreting covariates that appear to be statistically significant,
> may have a rather small *effect size* in context. That is where
> matter experts need to provide input as to interpretation of the
> significance of the variable, as opposed to the statistical
> significance of
> that same variable.
>> A general approach, is to simply pre-specify your model based upon
> simple considerations. Also, you need to determine if your goal for
> model is prediction or explanation.
>> What is the incidence of your 'event' in the sample? If it is say
> then you should have around 20,000 events. The rule of thumb for
> regression is to have around 20 events per covariate degree of
> freedom (df)
> to minimize the risk of over-fitting the model to your dataset. A
> continuous covariate is 1 df, a k-level factor is k-1 df. So with
> events, your model could feasibly have 1,000 covariate df's. I am
> that you don't have that much independent data to begin with.
>> So, pre-specfy your model on the full dataset and stick with it.
> with subject matter experts on the interpretation of the model.
>> BTW, this question is really about statistical modeling generally,
> really R specific. Such queries are best posed to general statistical
> lists/forums such as Stack Exchange. I would also point you to Frank
> Harrell's book, Regression Modeling Strategies.
>> Marc Schwartz
> M Pomati
> University of Bristol
David Winsemius, MD
Alameda, CA, USA
More information about the R-help