[BioC] Normalization by DEseq

Wed Oct 20 11:20:57 CEST 2010

Hi

On 10/19/2010 03:10 PM, Wolfgang Huber wrote:
> Normalisation: Briefly, the normalisation works as follows: if k_ij is
> the count of the i-th gene (or in your case, I guess, taxon) in the j-th
> sample, then we compute f_i as the geometric mean of these values across
> samples. The normalised count is k_ij / f_i.

Wolfgang forgot to mention the median, so let me try again:

First consider two replicate samples, indexed with j=1 and j=2. As they 
are replicates, we expect the counts for a given gene i to always have 
the same ratio k_i1 / k_i2. Of course, it will not be exactly the same 
ratio, but if you plot a histogram of these ratios, you will see that 
there is a clear, narrow peak, and its median (or its mode) should be a 
good estimate for the actual sequencing depth ratio of the two samples. 
Even if they are not replicates, the median should still be a reasonable 
estimator as long as not more than half of the genes are differentially 
expressed.

In order to deal with more than two samples, we define a fictive 
"reference sample" against which to compare everything. This reference 
sample has, as value for gene i, the geometric mean f_i over all samples 
j of the counts k_ij.

Then, to get a size factor for sample j, we divide the counts for the 
genes, k_ij, by the respective reference values f_j, and use the median 
of all those ratios k_ij/f_i as the size factor:
    s_j = median_j k_ij / f_i

>> I am wondering if I want to do differential expression analysis for
>> these two groups, should I filter out the group specific transcripts
>> before
>> putting into DEseq? Will this affect the normalization step?

Normally not, unless the H. pylori transcripts (which, I understand, you 
expect to turn up only in the patients, not in the healthy control) make 
up more than around half of the transcripts.

A useful diagnostic to double-check the size factors is a histogram of 
the ratios. You can do this as follows:

   library(DESeq)

   # get an example count data set -- or use your data:
   cds <- makeExampleCountDataSet()

   # estimate the size factors:
   cds <- estimateSizeFactors( cds )

   # calculate the gene-wise geometric means
   geomeans <- exp( rowMeans( log( counts(cds) ) ) )

   # choose a sample whose size factor estimate we want to check
   j <- 1

   # just for clarity: the size factor was estimated as described above,
   # so the following two lines give the same result
   print( sizeFactors(cds)[j] )
   print( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] )  )

   # Plot a histogram of the ratios of which we have just taken
   # the median:
   hist( log2( counts(cds)[,j] / geomeans ), breaks=100 )

   # This histogram should be unimodal, with a clear peak at the value
   # of the size factor. Mark it in red:
   abline( v=log2( sizeFactors(cds)[ j ] ), col="red" )

If the histogram looks fine, the normalization has succeeded. If the 
size factor does not hit the median, you can either filter down to genes 
which are present in control and patient, as you suggested, or you could 
also try to replace the median with another mode estimator, say, the 
shorth (defined in the genefilter package). (I had once a data set where 
the shorth seemed to work better than the median, but in general, I'd 
stick to the median, unless the histogram seems weird.)

Best regards

   Simon