[BioC] Using linear models to find differential gene expression (NGS)

Simon Anders anders at embl.de
Wed Sep 1 16:47:32 CEST 2010

Hi Johnny

On 09/01/2010 02:25 PM, Johnny H wrote:
> I have found some R/Bioconductor/Genominator code on the web (below) and it
> measures differential expression of RNA-seq short read data using a general
> linear model.

The code you quote uses a GLM of the Poisson family to analyse RNA-Seq.
Please note that this is not appropriate because it cannot and does not
assess biological variation!

As this might contradict what other people say here, I demonstrate:

Assume you have two conditions ("A" and "B") with two biological
replicates each. For this example, we further assume that the library
sizes are, due to some magic, the same in all four samples, so that we
don't have to worry about normalization.

Let's say we have, for a given gene, these counts:

> y <- c( A1=180, A2=200, B1=240, B2=260 )
> group <- factor( c( "A", "A", "B", "B" ) )

Let's see the within group mean and standard deviation:

> tapply( y, group, mean )
A   B
190 250
> tapply( y, group, sd )
A        B
14.14214 14.14214

From condition "A" to "B", the counts raise from an average of 190 to
an average of 250, with a within-group SD of below 15. This should be
significant, and it is:

> fit0 <- glm( y ~ 1, family=poisson() )
> fit1 <- glm( y ~ group, family=poisson() )

> anova( fit0, fit1, test="Chisq" )
Analysis of Deviance Table

Model 1: y ~ 1
Model 2: y ~ group
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1         3    18.2681
2         2     1.8533  1   16.415 5.089e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, for comparison, look at this data:

> y2 <- c( A1=100, A2=280, B1=150, B2=350 )
> group <- factor( c( "A", "A", "B", "B" ) )

The means are the same, but the SD is much larger:

> tapply( y2, group, mean )
A   B
190 250
> tapply( y2, group, sd )
A        B
127.2792 141.4214

An increase by 60 (from 190 to 250) should not be significant, if the
standard deviation within group is more than twice as much.

p value:

> fit0 <- glm( y2 ~ 1, family=poisson() )
> fit1 <- glm( y2 ~ group, family=poisson() )
> anova( fit0, fit1, test="Chisq" )
Analysis of Deviance Table

Model 1: y2 ~ 1
Model 2: y2 ~ group
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1         3     187.48
2         2     171.06  1   16.415 5.089e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The proper thing to do is to use a family that allows for
overdispersion, typically the negative binomial family, not the Poisson
family. For a detailed discussion of this issue, see our preprint:
http://precedings.nature.com/documents/4282/

For a solution of your problem that works for simple comparisons (one
condition against another one), see the our "DESeq" package (or Robinson
et al.'s "edgeR" package).

If you need to do something more sophisticated, like a two-way anova: we
are working on it.

Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de