[BioC] DESeq normalisation strategy

Wed May 29 11:46:10 CEST 2013

Hi Davide

On 29/05/13 10:58, Davide Cittaro wrote:
> I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts.
> Three questions:
> - is this strategy robust when comparing samples with extremely different library sizes?

Sure, why shouldn't it be?

> - If I wanted to calculate cpm on normalized counts, should I rescale the library size according to the sizeFactor?

Actually, no. I assume that by "cpm", you mean "Counts per million", 
which is a terse phrase meaning "number of reads mapped to the feature 
per one million of aligned reads". As such, "cpm" is _defined_ to mean 
the quantity that you get by dividing the counts for your feature by the 
number of aligned reads and multiply by one million.

The notion of "calculating cpm on normalized counts" is hence a 
contradiction in terms.

The whole point of DESeq's library size normalization is, of course, 
that simply dividing by the number of aligned reads is not a good 
strategy to get numbers which can be compared across samples, and that 
hence cpm, RPKM, FPKM or any of the other variations on the "per 
million" scheme are not useful quantities for differential analyses.

> - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides?

In principle, yes. The problem is that once your feature are very small, 
very many of the counts may be zero, and the geometric mean of any set 
of numbers containing at least one zero is zero. Hence, you can only use 
feature with sufficiently high counts to get a stable estimate, and you 
may not have enough of these.

   Simon