[BioC] limma eBayes: how to determine goodness of fit?

Tue Jun 5 20:54:59 CEST 2007

Hi Mark,

Thanks very much for your reply of a couple of days ago.  You kindly ask:

  > Can you give an example of what you mean here?  I'm not sure what "Very
  > nicely modeled by different coefficients" and "variety of model formulae
  > based on biological intuition" are really referring to.  Do you have a
  > designed experiment where you wish to look for differences between, say
  > treatments A, B, C, etc.?

In brief:  with lm (which I turned to at Gordon's suggestion) I can trawl my
data for specific expression patterns, using different model formulae and
stringent goodness-of-fit measures.  But I am stumped when I try to do
the same with limma.  I'd like to benefit from the refinements offered by limma,
eBayes, in particular.

Here's the longer story.

The experiment is a direct 2-color design.  There are 9 reference samples (call them
J1-J9) and 3 'signal' samples (M1-M3).  A brief description of the actual study may be
helpful:

  Malaria parasites exhibit a 'binding phenotype' when infecting primigravida
  (women pregnant for the first time).  The parasites attach to CSA exposed
  in the placental lining and cause lots of harm to mother and fetus.  We hope to
  elucidate the mechanisms of that binding by finding genes strongly up-regulated in
  primigravida.  So our basic question is: what genes behave markedly
  differently in primigravid parasites when compared to juvenile parasites?

The 28 microarray slides consist of 14 dye-swap pairs, for a total of 28 slides:

    M      J
    ---  ----------
    1    1 2 3 4 5 
    2    1 3 5 6 7 9
    3    4 7 8

My simplest linear model treats every M/J comparison identically.  And in this simple
comparison we found two genes (call them gene1 and gene2), each with a logFC ratio ~= 6,
and with very small residuals.  A great signal: these genes were uniformly up-regulated in
all comparisons.

I then examined genes a little further down in the limma topTable results.  I found a
gene3 whose statistics weren't bad (adj.P.Val=e-7, B=7.9), but a plot showed a very
interesting pattern: like gene1 and gene2, in had a great signal in all comparisons except
those involving M3.  This doesn't seem like random variation; this seems like another gene
which might be contributing (in 2 primigravida out of 3) to the binding phenotype.

So here's where my biological intuition come in: multiple mechanisms may be involved in
CSA binding.  gene1 and gene2 are perhaps always involved.  Parasites from M3 do not seem
to require gene3 to bind to the placenta, but maybe M1 and M2 do.  My evidence so far is
scant, but I wish to explore the data to test the hypothesis.

I followed Gordon suggestion and used bare-bones lm () to model each row in my MA object.
I was happy to see that I found small residuals, and a high R-squared when I modeled gene3
like this:

  gene3.lm <- lm (ratio ~ 1 + M3)  

My simpler model for gene1 and gene2 would be simply

  gene12.lm <- lm (ratio ~ 1)

Now I want to explore all the genes in the 28 comparisons to find (possibly) other
striking and biologically suggestive behavior.  Perhaps there are good R-squared values
for

    lm (ratio ~ 1 + M1)   or 
    lm (ratio ~ 1 + M2)

which might (if we're lucky) reveal other pieces of the multiple, conditional
contributors to the binding phenotype.

I'd like to do this sort of exploring in limma itself, rather than stick with what is, for
me, the more transparent but less rich lm.  I sense that a judicious use of makeContrasts,
contrast.fit, and decideTests may allow me to do this kind of exploration.  But I pore
over the manual, experiment for hours, and (it's embarrassing, but true) end up confused
and feeling rather a dunce.

Got any advice?  

Cheers!

 - Paul