[R] A troubled state of freedom: generalized linear models where number of parameters > number of samples

Sat Aug 21 16:44:58 CEST 2004

Good morning,

Thank you all for your help so far. I really appreciate it.

The crux of my problem is that I am generating a generalized linear
model with 1 dependent variable, approximately 50 training samples and
100 parameters (gene levels).

Essentially, if I have 100 genes and 50 samples, this results in
coefficients for the first 49 samples, and NAs for the rest, with an
ultra low residual deviance (usually approx. 10^-27). This seems to
have something to do with the number of degrees of freedom (since as
the number of genes increases up to 49, the number of residual degrees
of freedom drops to 0)

What kind of methods can I use to make sense of this? 

I have a subsequent set of samples to work on to validate the results
of this glm, so I am not sure if overfitting is really a problem.

Background: this is a microarray study, where I have divided the
samples in the training set into 2 groups, and generated a number of
genes to differentiate between both groups. I am going to use the GLM
in a subsequent regression analysis to determine survival. For this
purpose, I need to generate some kind of score for each individual
case using the coefficients of each gene level * gene expression
level.

I am not a statistician (but a clinician) - many apologies if I am not
conveying myself very clearly here!

Thanks. 

Min-Han Tan