[BioC] Using linear models to find differential gene expression (NGS)

Wed Sep 1 16:47:32 CEST 2010

Hi Johnny

On 09/01/2010 02:25 PM, Johnny H wrote:
> I have found some R/Bioconductor/Genominator code on the web (below) and it
> measures differential expression of RNA-seq short read data using a general
> linear model.

The code you quote uses a GLM of the Poisson family to analyse RNA-Seq. 
Please note that this is not appropriate because it cannot and does not 
assess biological variation!

As this might contradict what other people say here, I demonstrate:

Assume you have two conditions ("A" and "B") with two biological 
replicates each. For this example, we further assume that the library 
sizes are, due to some magic, the same in all four samples, so that we 
don't have to worry about normalization.

Let's say we have, for a given gene, these counts:

 > y <- c( A1=180, A2=200, B1=240, B2=260 )
 > group <- factor( c( "A", "A", "B", "B" ) )

Let's see the within group mean and standard deviation:

 > tapply( y, group, mean )
   A   B
190 250
 > tapply( y, group, sd )
        A        B
14.14214 14.14214

 From condition "A" to "B", the counts raise from an average of 190 to 
an average of 250, with a within-group SD of below 15. This should be 
significant, and it is:

 > fit0 <- glm( y ~ 1, family=poisson() )
 > fit1 <- glm( y ~ group, family=poisson() )

 > anova( fit0, fit1, test="Chisq" )
Analysis of Deviance Table

Model 1: y ~ 1
Model 2: y ~ group
   Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1         3    18.2681
2         2     1.8533  1   16.415 5.089e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, for comparison, look at this data:

 > y2 <- c( A1=100, A2=280, B1=150, B2=350 )
 > group <- factor( c( "A", "A", "B", "B" ) )

The means are the same, but the SD is much larger:

 > tapply( y2, group, mean )
   A   B
190 250
 > tapply( y2, group, sd )
        A        B
127.2792 141.4214

An increase by 60 (from 190 to 250) should not be significant, if the 
standard deviation within group is more than twice as much.

But as a Poisson regression does not care about this, we get a very low 
p value:

 > fit0 <- glm( y2 ~ 1, family=poisson() )
 > fit1 <- glm( y2 ~ group, family=poisson() )
 > anova( fit0, fit1, test="Chisq" )
Analysis of Deviance Table

Model 1: y2 ~ 1
Model 2: y2 ~ group
   Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1         3     187.48
2         2     171.06  1   16.415 5.089e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The proper thing to do is to use a family that allows for 
overdispersion, typically the negative binomial family, not the Poisson 
family. For a detailed discussion of this issue, see our preprint: 
http://precedings.nature.com/documents/4282/

For a solution of your problem that works for simple comparisons (one 
condition against another one), see the our "DESeq" package (or Robinson 
et al.'s "edgeR" package).

If you need to do something more sophisticated, like a two-way anova: we 
are working on it.

   Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de