[BioC] DESeq normalisation strategy

Davide Cittaro cittaro.davide at hsr.it
Thu May 30 08:08:22 CEST 2013


Hi Simon, 

On May 29, 2013, at 11:46 AM, Simon Anders <anders at embl.de> wrote:

> Hi Davide
> 
> On 29/05/13 10:58, Davide Cittaro wrote:
>> I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts.
>> Three questions:
>> - is this strategy robust when comparing samples with extremely different library sizes?
> 
> Sure, why shouldn't it be?
> 

You know, just a check :-) 
In a small dataset I've artificially reduced the counts for a sample by different factors and checked the ratios between the counts of that sample and an invariant one. Indeed there are different but the rms is really small.
> 
> The notion of "calculating cpm on normalized counts" is hence a 
> contradiction in terms.

I somehow agree with you, I'm a bit puzzled about the fact I've seen this in other packages (such as edgeR, but that may be another story). 
> 
>> - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides?
> 
> In principle, yes. The problem is that once your feature are very small, 
> very many of the counts may be zero, and the geometric mean of any set 
> of numbers containing at least one zero is zero. Hence, you can only use 
> feature with sufficiently high counts to get a stable estimate, and you 
> may not have enough of these.

Well, that happens also with intervals, especially if you deal with some kind of ChIP-seq experiments. The way you use to calculate factors goes through log(counts), and you exclude intervals with at least one zero count. I tried to get the size factors sampling my dataset and using 1/10 of it and the factor estimates are quite robust. 
My problem, if that was not clear, is that I would like to have a normalization strategy for signals across the genome. Typically these are at small-interval level (less than 200 bp)

Thanks

d


More information about the Bioconductor mailing list