[R] min frequencies of categorical predictor variables in GLM

Mon Aug 3 16:46:20 CEST 2009

On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:

> Hi,
>
> Suppose a binomial GLM with both continuous as well as categorical  
> predictors (sometimes referred to as GLM-ANCOVA, if I remember  
> correctly). For the categorical predictors = indicator variables, is  
> then there a suggested minimum frequency of each level ? Would such  
> a rule/ recommendation be dependent on the y-side too ?
>
> Example: N is quite large, a bit > 100. Observed however are only  
> 0/1s (so Bernoulli random variables, not Binomial, because the  
> covariates are from observations and in general always different  
> between observations). There are two categorical predictors, each  
> with 2 levels. It would structurally probably also make sense to  
> allow an interaction between those, yielding de facto a single  
> categorical predictor with 4 levels. Is then there a minimum of  
> observations falling in each of the 4 level category (either  
> absolute or relative), or also that plus also considering the y-side ?

Must be the day for sample size questions for logistic regression. A  
similar query is on MedStats today.

The typical minimum sample size recommendation for logistic regression  
is based upon covariate degrees of freedom (or columns in the model  
matrix). The guidance is that there should be 10 to 20 *events* per  
covariate degree of freedom.

So if you have 2 factors, each with two levels, that gives you two  
covariate degrees of freedom total (two columns in the model matrix).  
At the high end of the above range, you would need 40 events in your  
sample.

If the event incidence in your sample is 10%, you would need 400 cases  
to observe 40 events to support the model with the two two-level  
covariates (Y ~ X1 + X2).

An interaction term (in addition to the 2 main effect terms, Y ~ X1 *  
X2) in this case would add another column to the model matrix, thus,  
you would need an additional 20 events, or another 200 cases in your  
sample.

So you could include the two two-level factors and the interaction  
term if you have 60 events, or in my example, about 600 cases.

If you include the interaction term only in the absence of the main  
effects (Y ~ X1:X2), that would yield 4 columns in the model matrix,  
requiring 80 events, or about 800 cases. Without more details (eg.  
your underlying hypothesis), it is not clear to me that you gain  
anything here as compared to the use of the main effects and  
potentially, the interaction term together, and you certainly lose in  
terms of model interpretation and requiring a notably larger sample  
size.

Relative to a minimum sample size for each of the levels in the factor  
based covariates, I am not aware of any specific guidance there, short  
of dealing with empty cells at the extreme. However, there are methods  
to assess covariate complexity and the consideration for the  
collapsing of factor levels. For more details on these issues, I would  
refer you to Frank's book, Regression Modeling Strategies,  
specifically to chapters 4 and 10-12. The former focuses on general  
multivariable strategies and the latter focuses on LR. More  
information here:

   http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS

HTH,

Marc Schwartz