[BioC] Re: "validity" of p-values

Gordon Smyth smyth at wehi.edu.au
Mon Sep 29 14:03:13 MEST 2003

At 06:27 AM 29/09/2003, Jenny Drnevich wrote:
>See below...
> >>However, have you seen: Chu, Weir, & Wolfinger.  A systematic
> >> statistical linear modeling approach to oligonucleotide array
> >> experiments MATH BIOSCI 176 (1): 35-51 Sp. Iss. SI MAR 2002
> >>They advocate using the probe-level data in a linear mixed model.
> >> Assuming that each probe is an independent measure (which I know is not
> >> true because many of them overlap, but I'm ignoring this for now),
> >> using probe-level data gives 14-20 "replicates" per chip. We've based
> >> our analysis methods on this, and with two biological replicates per
> >> genetic line, and three genetic lines per phenotypic group, we've been
> >> able to detect as little as a 15% difference in gene expression at
> >> p=0.0001 (we only expect 2 FP and get 60 genes with p=0.0001).
> >
> > Mmmm. Getting very low p-values from just two biological replicates
> > doesn't  lead you to question the validity of the p-values?? :)
>But we don't just have two biological replicates. We're interested in
>consistent gene expression differences between phenotype 1 and phenotype
>2. We looked at three different genetic lines showing phenotype 1 and
>three other lines that had phenotype 2.

If I understand correctly, you have 6 arrays on each phenotype, all 
biologically independent.

>  We made two biological replicates
>of each line, and the expression level of each gene was estimated by 14
>probes. By running a mixed-model ANOVA separately for each gene with
>phenotype, line (nested within phenotype), probe, and all second-order
>interactions, the phenotype comparison has around 120 df (or so, off the
>top of my head).

There are only 2 phenotypes, so the phenotype comparison has 1 df. I think 
what you mean is that you have something like 120 df for estimating the 
variability of repeated measurements at the probe level. But this isn't the 
most important variance component for comparing phenotypes. Your model, if 
I understand it, neglects any variance component at the array level even 
though your treatments (the phenotypes) are applied at the array level. You 
are in a way treating the probes as if they were separate arrays, and one 
doesn't have to be a mathematical statistician to question to validity of 

>  That's how we can detect a 15% difference in gene
>expression. As long as the statistical model is set up correctly, I never
>"question" the validity of p-values, although I might question the
>biological significance... :)

You should! A famous and true saying goes "All statistical models are 
wrong, but some are useful." It is encumbant on you to understand how the 
assumptions of your statistical model relate to reality and how sensitive 
your conclusions are to these assumptions.

There are actually deep reasons why, in my opinion, none of the statistical 
methods for small numbers of arrays can produce p-values which are 
believable in an absolute sense (and this inclues my own methods in the 
limma package).

The real test would be to try out your method on some data sets where the 
answers are known, for example to apply to methods to some replicate arrays 
hybridized with RNA from the same source. My guess is that the method would 
detect a lot of spurious differential expression.


More information about the Bioconductor mailing list