[BioC] Using linear models to find differential gene expression (NGS)

Wed Sep 1 14:58:45 CEST 2010

This code is from the paper

Bullard et al. Evaluation of statistical methods for normalization and
differential expression in mRNA-Seq experiments. BMC Bioinformatics
(2010) vol. 11 (1) pp. 94

which extensively discusses why to use a GLM and why to include an
offset.  Most of the questions you have below are essentially basic
statistics questions and are not - in my opinion - very suitable for a
discussing on an email list.  I suggest you find a local statistician
or someone who is familiar with GLMs in order to explain this (they
don't need to know anything about RNA-Seq).  But inline below there
are a few quick answers.

On Wed, Sep 1, 2010 at 8:25 AM, Johnny H <ukfriend22 at googlemail.com> wrote:
> Hi.
> I have found some R/Bioconductor/Genominator code on the web (below) and it
> measures differential expression of RNA-seq short read data using a general
> linear model.
>
> Can someone explain some basic questions I have?
>
> 1) What is the reason for using 2 glm's for measuring differential
> expression?

You (always) compare two models in statistical test (whether or not it
is explicit).  One model says there is a group specific effect and the
other model says there is no difference between the two models.

> 2) In the function(y) there are two linear models ran; one with argument y ~
> groups and the other with argument y ~ 1. Why do this?

See 1.

> 3) What does the offset do?

It controls for the sequencing effort, more or less like dividing with
the total number of mapped reads when computing RPKM.  But if you just
straight up divide, you don't have integer values anymore.

> 4) Why use ANOVA; is to compare the linear models?

In this case it computes a likelihood ratio test statistic between the
two models and computes asymptotic p-values based on a Chisq
approximation.

> 5) What can we say about results, if adjusted for multiple testing; how
> would you explain a significant result?

Usually people want to control the FDR.  You can do that using for
example rawp2adj2 from the multtest package.  You need the full vector
of p-values.  Note that it has been shown that this poisson test is
terrible for biological replicates (or at least terrible in the sense
of giving bad p-values).

> 6) Would an adjusted p-value of <= 0.05 be significant?

See 5), but that would generally be the aim of using adjusted
p-values.  However, adjustment procedures assume you have valid raw
p-values before adjusting.

>
> Basically, I don't know much about the statistics done below and any advice
> or pointers to good literature for this would be a great help. Thank you.
>
>> head(geneCountsUI)
>        mut_1_f mut_2_f wt_1_f wt_2_f
> YAL069W       0       0      0      0
> YBL049W      19      18     10      4
>
> # Normalisation of RNA-seq lanes
> notZero <- which(rowSums(geneCountsUI) != 0)
> upper.quartiles <- apply(geneCountsUI[notZero, ], 2, function(x) quantile(x,
> 0.75))
> uq.scaled <- upper.quartiles/sum(upper.quartiles) * sum(laneCounts)
>
> # Calculating differential expression
> groups <- factor(rep(c("mut", "wt"), times = c(2, 2)))
>
> pvalues <- apply(geneCountsUI[notZero, ], 1,
>  function(y) {
>    fit <- glm(y ~ groups, family = poisson(), offset = log(uq.scaled))
>    fit0 <- glm(y ~ 1, family = poisson(), offset = log(uq.scaled))
>    anova(fit0, fit, test = "Chisq")[2, 5]
> })
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>