[BioC] How to use DESeq to normalize and estimate variance in a RNAseq timecourse analysis

Simon Anders anders at embl.de
Fri May 11 16:23:19 CEST 2012


Hi

On 05/11/2012 11:31 AM, Marie Sémon wrote:
> The size factors estimated on the complete table are the following:
>
> Ctrl1 0.811399035473249
> Ctrl2 0.858304900598826
> Ctrl3 0.959802357106788
> T1 0.947672016144435
> T2 1.05315240155981
> T3 1.13022212977686
> T4 1.22731615452888
> T5 1.19028477928069
>
> The size factors estimated on a partial table (restricted to Controls +
> T5) are the following:
>
> Ctrl1 0.868784756382365
> Ctrl2 0.918880737221278
> Ctrl3 1.020617176156
> T5 1.2627166738945
>
> As you can see, they seem to be quite different.

Actually, not at all. The ratio of Ctrl1 to T5, for example, is 
0.81/1.19=0.68 in the full data and 0.87/1.26=0.69 in the reduced case.

Note that the absolute values of the size factors are meaningless and 
arbitrary (and DESeq simply sets them to have geometric mean 1). only 
the ratios are important, and they are the same.

 > This seem to translate
> in different numbers of significant genes (between Ctr and T5) for the
> two cases (2755 genes with padj<0.001 when the complete table is taken
> into account, and 2976 genes with padj <0.001 for the partial table is
> taken into account). Furthermore, the lists do not overlap completely:
>
> FALSE TRUE
> FALSE 18135 303
> TRUE 82 2673

I'm not sure if 303/2673 is a large or a small change. Maybe many genes 
were close to the significance threshold and just stumbled to the other 
side. This is why a scatter plot of log p values (non-adjusted) is 
usually more informative to check whether two different ways to test for 
the same null hypothesis give similar results.

By the way, cutting at 0.001 is extremely stringent. Commonly, people 
cut at 0.05 or 0.1. Remember that the threshold is the false discovery 
rate (FDR), i.e., you decide which fraction of false positives you are 
willing to tolerate in your result list.

> We picked up randomly two genes (gene A and gene B) and show DEseq
> results comparing Ctrls and T5, after normalizing using the Partial or
> Complete table
>
> Partial table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneA 1129.345865 965.1611989 1621.899863 1.680444536 0.748842926
> 0.000170905 1.19E-03
> Complete table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneA 1203.113666 1030.619339 1720.596647 1.669478324 0.739397362
> 8.74E-06 9.09E-05
>
>
> Partial table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneB 16.32456228 3.55138732 54.64408717 15.38668758 3.943610779
> 4.28E-05 3.47E-04
> Complete table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneB 17.33523065 3.79053399 57.96932062 15.29318053 3.934816571
> 0.001910755 1.06E-02

You are right, the difference in the p values is appreciable. The log 
fold changes are similar, though, so normalization is not the issue. 
Maybe compare the dispersion fit plots (see vignette), and the fitted 
dispersion values (accessible via 'fitInfo').

I am a bit confused how this can happen. Maybe post your full commands, 
with sessionInfo etc. I wonder if something might be wrong there (unless 
you may have uncovered a bug.)

   Simon



More information about the Bioconductor mailing list