[BioC] GOStat and multiple testing

Sat Aug 7 07:24:53 CEST 2004

Hi,
I have been thinking about the problem a little differently, and I am
working on a write-up. 

The first approach is, rather than testing each node independently,
fit the whole go tree as a logistic regression with the response a {0,1}
for each gene (ie expressed, not} and the predictors the whole go-tree.
Obviously that alone is not the solution, but if we treat this as a
spatial problem (the dag being the spatial join matrix) we can create a
set of difference penalties on the coefficients that are dictated by the
dag structure. Then we can do inference on the the beta's to see which
terms are influential. One would fit a separate model for each category,
Cellular component, biological Process, Molecular function.

The second approach, that I haven't developed as far, is to think of
each gene as starting at the root, and look at its survival along 
paths in the dag, so that the aggregate is some sort of branching
process. I don't know yet if this model is useful.

Also inherent in the analysis is a potentially huge bias due to
unannotated genes. I was thinking of approaching this using a kind of
mark-recapture approach on the terms, kind of like the stochastic
abundance models they use in ecology to predict the number of species in
a community. We can come up with a bias correction if we have a term
abundance distribution for the GO-classes.

The logistic model is something I am actively working on, the rest are
half baked thoughts I have been diddling with and haven't had time to
chase too far.

Nicholas