[BioC] finding a very large number of false positives using edgeR

Thu Jan 16 00:59:16 CET 2014

Hi,

On Wed, Jan 15, 2014 at 3:07 PM, Blum, Charles <CBlum at mednet.ucla.edu> wrote:
> Hi,
>
> I am running edgeR on 6 RNAseq samples that were generated using the exact same protocol but are from different Illumina project runs.
> In theory, no genes should be differentially expressed. Nevertheless, edgeR identifies almost 7,000 genes as DE at a FDR rate of 0.1. This is very puzzling.
>
> I ran edgeR using the classic approach (exactTest)  and the glm approach.
>
> To get an idea of sequencing depth:
> Sample:                                                    Project1_sample1  Project1_sample2      Project1_sample3    Project2_sample1    Project2_sample2    Project2_sample3
> Total unique annotated read counts:             41,440,190               26,429,859                  29,655,944                  25,423,167               30,914,059                   35,41,714
>
> Could it be due to the variability in sequencing depth between projects?

Shouldn't be such a big issue -- even the differences between sample
sized you see here are not very large.

> Could there anything else in the data or analysis that could violate any assumptions made by edgeR?
> Is there any known problems with the newest version of edgeR?

My guess would be "no" -- you could, of course, try the same analysis
with limma::voom or DESeq2 to see, but ..

Anyway, could you show us the code you used to do the analysis -- the
design matrix would be of particular interest along with the
coefs/contrasts you are testing, but the whole (relevant) code would
be good (ie. from DGEList -> dispersion estimation -> design matrix
setup -> and the farious *fit + *table functions).

Are you simply testing differential expression between the replicates
of Project1 and those of Project2? Presumably your issue is that these
are libraries sequenced from what you expect to be the same type of
sample/tissue/cell-line/whatever?

Perhaps encoding the "batch" (projectID) as another covariate into
your design could help mitigate these issues, but I'm not sure what
samples you're testing against what, so can't say anything for sure.

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech