[BioC] finding a very large number of false positives using edgeR

Thu Jan 16 18:39:35 CET 2014

Hi,

Comments in line:

On Wed, Jan 15, 2014 at 4:58 PM, Blum, Charles <CBlum at mednet.ucla.edu> wrote:
> Hi,
>
> I did run the same analysis using edgeR (glm), edgeR (as below) and  DESeq. All had very similar results.

OK, since the DESeq results are apparently concordant with your edgeR
results, this means we can likely assume that the answer to your "Is
there any known problems with the newest version of edgeR?" would be
no ;-)

> Yes, I am simply testing between biological replicates with the exact same treatment only from 2 different Illumina runs.
>
> This  simple example edgeR code also gave similar results:
>
>> group
> S180_Total_30_r1 S180_Total_30_r2 S180_Total_30_r4  S437_Total_30_1  S437_Total_30_2
>             S180             S180             S180             S437             S437
>  S437_Total_30_3
>             S437
> Levels: S180 S437
>
> y <- DGEList(counts=A, group=group, genes=genes)
>  y <- calcNormFactors(y)
> y <- estimateCommonDisp(y)
> y <- estimateTagwiseDisp(y)
> fit <- exactTest(y)

Sorry -- I'm still having a problem understanding the association of
samples to treatment / Illumina run. Which samples are the controls?
Which are the treated? Which belong to which "run"

It'd be more helpful if you create a data.frame that has as many rows
as samples with, at least, the following columns, and show us that:

* run: This indicates which "Illumina run" the sample is from and
would be a factor with two levels: "run1" and "run2"

* treatment: A factor with two levels: "WT" and "treatment" indicating
the grouping.

Being that you are in the edgeR universe, I'd then see how these
samples "cluster" together via an MDS plot (`plotMDS`). There are
several examples of how to use it in the edgeR User's guide, however
the example in the RNAseq case study in the limma user's guide
(limma::voom ~ page117) might be more informative as it will show you
how to differentially color and label each plot -- if I were you I'd
label each point with the "run" factor, and color by treatment.

If the points are clustering together by `run` instead of `treatment`,
then you see the problem.

You could, in principle, use something like sva/combat to remove this
batch effect:
http://bioconductor.org/packages/release/bioc/html/sva.html

However there is some sample size considerations required for it to be
used reliably:
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053098.html

As Gordon points out in that same thread I just linked to, the best
you might be able to do is just adjust for batch (`run`) in the linear
model. In fact, section 4.5 (RNA-Seq of pathogen inoculated
Arabidopsis with batch effects) of the edgeR User manual shows you
exactly how you can proceed.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech