# [R] Question about data used to fit the mixed model

Douglas Bates bates at stat.wisc.edu
Tue Aug 1 00:28:53 CEST 2006

On 7/29/06, Nantachai Kantanantha <kantanantha at hotmail.com> wrote:
> Hi everyone,
>
> I would like to ask a question regarding to the data used to fit the mixed
> model.
>
> I wonder that, for the response variable data used to fit the mixed model
> (either via "spm" or "lme"), we must have several observations per subject
> (i.e. Yij,  i = 1,..,M,  j = 1,.., ni) or it can be just one observation per
> subject (i.e. Yi,  i = 1,...,M). Since we have to specify the groups for
> random effect components, if we have only one observation per subject, then
> each group will have only one observation.

As Harold Doran mentioned in his earlier reply, if you only have one
observation in each group you can't estimate the parameters in a mixed
model because the random effect for a group is completely confounded
with the per-observation noise term for the observation.  The model
would be of the form

X\beta + Z b + \epsilon

for which you would estimate the variance of the components of b and
the variance of the components of \epsilon.  However, with only one
observation per group the number of components in b and in \epsilon
would be the same and, by suitably reordering the observations, the
matrix Z could be made to be an identity matrix.  Thus the model
reduces to

X\beta + (b + \epsilon)

and the elements of b are confounded with those of \epsilon.

A different version of this question is to ask whether some of the
groups can have only a single observation while others have more that
one observation.  The answer to that is a qualified "yes".

An example of data with different numbers of observations per group is
the star data that Harold mentioned.  The "student" identifier in this
data set is named "id".  If we table the number of observations per
student then table that result we get a table of the number of
students with 1, 2, 3 or 4 observations.

> data("star", package = 'mlmRev')
> table(table(star$id)) 1 2 3 4 4314 2455 1744 3085 > length(unique(star$id))
[1] 11598
> 4314/11598
[1] 0.3719607

This shows that more than a third of the students have data from only
a single year.

It is possible to include such students in a mixed model with a random
effect for student.  It is even possible to include such students in a
mixed model with a random intercept and a random slope with respect to
time for student.  However, such students contribute very little
information to the model fit and the "estimates" (actually
"predictors") of the random effects for such students are artificially
small because they are confounded with the per-observation noise term.

So while it can be attractive when designing an experimental or
planning a observational study to have many groups and few
observations per group, such experiments or studies provide very
sparse information.  Using a mixed model on such data doesn't
magically add information to the data.  Mixed models are statistical
models, not magic.