[BioC] edgeR normalization factors

Simon Anders anders at embl.de
Wed Jun 30 08:54:51 CEST 2010


Hi

On Tue, 29 Jun 2010 21:53:18 +0800 (CST), 王喆 <zhedianyou at yahoo.cn> wrote:
> I disagree with Naomi.
> 
> First, for a differential expression analysis, we prefer to use the
counts
> as is, and use the normalization factors as offsets in the statistical
> modeling.  So, these normalization factors actually DO NOT change the
> data
> (this is unlike microarray data normalization).
> 
> Second, for clustering, visualization etc. you may want to calculate a
> normalized expression value.  Using the normalization factors that you
> calculate using calcNormFactors() multiplied by the library size (See
> Section 6 of the manual), you could DIVIDE your raw counts by this
number
> for each library.  Maybe also multiple by 10M so you have counts per
10M?
> 
> I think what Naomi is talking about (highly expressed genes depressing
the
> expression of other genes) is covered in our paper:
> http://genomebiology.com/2010/11/3/R25

For visualization, the normalized values should to the job. For
clustering, however, you may still run into problem, because count data,
normalized or not, is heteroskedastic, and if you feed such data to a
typical distance function such as R's 'dist', the result will depends
nearly only on the most strongly expressed genes as they have the strongest
variance.

Hence, you should perform a variance-stabilizing transformation (VST) on
the data before handing it to dist (or to any other statistical function
that is designed for homoskedastic data).

Our 'DESeq' package (another tool for the same use case as edgeR, using a
different way to estimate variance) has such a function
('getVarianceStabilizedData'), but it assumes that you use DESeq's variance
estimation scheme and the vignette explains how to use it e.g. for
clustering. 

If you prefer to stick to edgeR: To my knowledge, it does not have this
functionality but you could add it yourself with a one-liner as follows:
edgeR's variance-mean ratio is

   variance = mean + common_dispersion * mean^2

and from such a function, the is obtained by integrating variance^(-1/2)
w.r.t. mean. According to Wolfram Alpha, this gives 

  transformed_data = 2 * asinh( sqrt( common_dispersion * normalized_count
) ) /
     sqrt( common_dispersion )

but you may want to double-check this.

  Simon



More information about the Bioconductor mailing list