[BioC] edgeR: very low p-value and very high variance within the group of replicates

Wed Jun 19 20:33:06 CEST 2013

Hi Valentina,

We have seen a few instances like this. Despite the high variance of this gene, it's dispersion is "shrunk" towards the dispersion for genes of similar average abundance.  This makes it a bit sensitive to outliers, but in general this sharing of information has positive effects on the power.

A few thoughts/suggestions:

1. Does a plotMDS(y) suggest that 'sample2_2' is much different than the other replicates?  The two genes you show appear to be quite different for that sample.

2. I would suggest upgrading to R-3.0.x and edgeR 3.2.x; I believe some defaults have changed; in particular, the prior.df parameter (that controls the degree of moderation) in estimateTagwiseDisp() has been decreased, which will dampen the effect of these "outlier" observations somewhat.  You may wish to lower prior.df even further.

3. A PhD student of mine has a promising solution for this that offers a good intermediate between resistance to outliers and power, but it's not quite ready for prime time.  Stay tuned, it should be ready soon.

4. Consider adding estimateTrendedDisp() to your pipeline, like:

y <- estimateCommonDisp(y)
y <- estimateTrendedDisp(y)
y <- estimateTagwiseDisp(y)

Best regards, Mark

----------
Prof. Dr. Mark Robinson
Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://tiny.cc/mrobin

On 19.06.2013, at 15:50, valentina [guest] <guest at bioconductor.org> wrote:

> 
> I have a question about you edgeR, in order  to fix a problem that may be very simple for you.
> 
> I'm using `edgeR`  to perform differential expression analysis from RNA-seq experiment.
> 
> I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups.
> 
> I ran the `edgeR` pakage like follow:
> 
> x <- read.delim("my_reads_count.txt", row.names="GENE")
> group <- factor(c(1,1,1,2,2,2))
> y <- DGEList(counts=x,group=group)
> y <- calcNormFactors(y)
> y <- estimateCommonDisp(y)
> y <- estimateTagwiseDisp(y)
> et <- exactTest(y)
> 
> I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant.
> 
> This is an example for `my_reads_count.txt`:
> 
> GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3
> ENSG00000198842 0 3 2 2 6666 3
> ENSG00000257017 3 3 25 2002 29080 4
> 
> And for `my_edgeR_results.txt`:
> 
> GENE logFC logCPM PValue FDR
> ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04
> ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07
> 
> It seems very odd that "0 3 2" versus "2 6666 3" is significant....
> I would like that the variance within the group is considered. Can you help me? Some suggestion?
> Are there some alternative function in edgeR that is capable to fix my problem??
> 
> -- output of sessionInfo(): 
> 
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-redhat-linux-gnu (64-bit)
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] edgeR_3.0.8              limma_3.12.3             tweeDEseqCountData_1.0.8 Biobase_2.16.0           BiocGenerics_0.2.0       BiocInstaller_1.8.3     
> 
> loaded via a namespace (and not attached):
> [1] tools_2.15.0
> 
> --
> Sent via the guest posting facility at bioconductor.org.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor