[BioC] Expected number of DE genes?

Wed Jul 16 08:12:02 CEST 2014

Dear Jessica

On 16/07/14 02:15, Jessica Perry Hekman wrote:
> Thanks, Tom. Yes, you summarized my dilemma well, although I am more
> concerned with false negatives right now than false positives (as we do
> intend to do PCR to validate any positives we get, but any false
> negatives lost are lost forever :).

I wonder whether you might have fallen for a fundamental but quite 
common misunderstanding here, because false positives and false 
negatives are not treated equal in a hypothesis test.

In both edgeR and DESeq, you choose a false discover rate (FDR); in the 
examples of the vignette, we use 10%, but this is by no way the only 
useful value. This means that you ask DESeq2 to give you a list of genes 
that are differentially expressed and that this list should not contain 
more than 10% false positives, and that you are willing to accept as 
many false negatives as it takes to ensure that.

More succinctly: If a gene is not called significant, this does not mean 
that the algorithm thinks that it is not differentially expressed but 
merely that it cannot say whether it is.

One other important issue is: What does "significantly diferentially 
expressed" actually mean? In biological systems, all components are so 
highly interconnected that is seems implausible to think that there are 
any genes which are not at all affected by your treatment, not even 
slightly. I would argue that, in typical experiments, most if not all 
genes change their expression strength at least a tiny bit in reaction 
to treatment. The question is whether the difference that you observe 
between the mean expression in treatment and control samples is driven 
by this reaction to treatment, or whether it is mainly driven by random 
fluctuations, i.e., by those differences that you also see when 
comparing samples treated the same way (replicates). When the random 
noise has the stronger effect, then the observed difference (log fold 
change) will be in a random direction and may or may not be in the 
direction that the treatment has affected the gene.

Hence, my (somewhat personal) opinion on what a significant p value 
means in DE analysis, namely: We got the sign right.

A significant call means that we can have confidence in the observed 
direction of the change. The effect of treatment on this gene was strong 
enough that we can say with confidence whether the gene reacted with up- 
or with down-regulation.

Hence, if you see less DE genes than expected, this means: The effect of 
your treatment was too weak to be be seen against the noise from random 
sample-to-sample variation (or equivalently: the variation within 
treatment groups was too strong and drowned the treatment signal). It 
does not mean that there was no effect.

To judge whether your results are typical, you would need to tell us 
more about your experiment.

   Simon