[BioC] edgeR - estimateGLMCommonDisp - warnings - huge logFC

Gordon K Smyth smyth at wehi.EDU.AU
Sat Jul 9 01:11:45 CEST 2011


Dear Ioannis,

It is asking quite a lot of a statistical procedure, to perform an 
analysis of statistical significance for your data, in the absence of any 
replication.  The short answer is that you are probably asking too much!

In the absence of replicates your choices are:

1. Be satisfied with an MDS plot and by looking at fold changes.  Do not 
attempt a significance analysis.

2. Estimate the dispersion by removing one or both of your factors from 
the model.  You could remove the factor with smallest fold changes for 
example.  This will only be successful if the number DE genes is 
relatively small.

3. You could try

   estimateGLMCommonDisp(method="deviance",robust=TRUE,subset=NULL)

This is our current best attempt at an automatic method to estimate 
dispersion without replicates.  It will only give good results when the 
counts are not too small and the number of DE genes is not too large.

4. You could identify a number of transcripts that should not be DE, and 
estimate the dispersion from them:

   d0 <- estimateCommonDisp(d[housekeeping,])

5. Simply pick a reasonable dispersion value (dispersion=0.3 say), based 
on your experience with similar data, and use that.  Although subjective, 
this is much better and more defensible than assuming Poission variation.


As far as fold-changes are concerned, I am currently working with a PhD 
student on the problem of shrinking fold changes to be more reliable, but 
the simple approach of adding a constant to your counts is not 
statistically meaningful in the context of an edgeR analysis.  Adding a 
constant may be a useful thing to do, and I often recommend:

   y <- t(log2( t(counts+0.5) / (lib.size+0.5))

for descriptive purposes, but the resulting log-abundances should be 
treating as roughly normal (eg analysed in limma) rather than analysed in 
edgeR.

Obviously, you can't creat a volcano plot unless you have a significance 
analysis as well.

Best wishes
Gordno

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Fri, 8 Jul 2011, Filippis, Ioannis wrote:

Dear Gordon,

thank you very much for the reply.

The problem in using as common dispersion the dispersion among the 
samples, is that this dispersion is too high leading to 0 DE genes. And I 
expect to have such high diversion, it is not some artifact of the data.

I also wonder whether it is statistically wrong to add 1 to the counts 
matrix before any edgeR analysis. This would lead to "meaningfull" logFC 
values and logical volcano plots.

Many thanks for your help.

Best regards,
Ioannis
________________________________________
From: Gordon K Smyth [smyth at wehi.EDU.AU]
Sent: 08 July 2011 12:16
To: Filippis, Ioannis
Cc: Bioconductor mailing list
Subject: edgeR - estimateGLMCommonDisp - warnings - huge logFC

Dear Ioannis,

If the counts are zeros for some libraries for some genes, then it should
be no surprise that some of the logFC might be very large.  The raw fold
changes are infinite.

The real problem though is that running estimateGLMCommonDisp() without
replicates is meaningless, since the dispersion is not actually estimable
without replicates.  The function will probably just return a dispersion
of zero in this case.

If you must analyse RNA-Seq data without replicates, you could estimate
the dispersion very roughly by treating all the libraries as if they were
replicates, by

    d2 <- estimateCommonDisp(d), or
    d2 <- estimateGLMCommonDisp(d)

and then proceed using this conservative dispersion estimate.

Best wishes
Gordon

> Date: Fri, 8 Jul 2011 08:18:56 +0000
> From: "Filippis, Ioannis" <i.filippis at imperial.ac.uk>
> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
> Subject: [BioC] edgeR - estimateGLMCommonDisp - warnings - huge logFC
> Content-Type: text/plain
>
> Hi,
>
> I am using edgeR for a 2x2 factorial design (Strain*Treatment) without
> any replicates and the estimateGLMCommonDisp and glmFit functions.
>
> When I run estimateGLMCommonDisp, I get warnings
> 1: In optimize(f = fun, interval = interval^0.25, y = y,  ... :
>  NA/Inf replaced by maximum positive value
> and when I run glmFit and then glmLRT, I get huge fold change values for some genes.
>
> However, if I do a pairwise exactTest for the samples examined for the 
> above contrast, the fold change for that genes is high but normal.
>
> I would really appreciate any feedback on the cause of warnings and huge logFC.
>
> Many thanks for your help.
>
> Best,
> Ioannis

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list