[BioC] The difference between three methods in calcNormFactors() in edgeR

Gordon K Smyth smyth at wehi.EDU.AU
Sun Jul 6 04:38:06 CEST 2014


Dear Zhan Tianyu,

The edgeR authors obviously recommend TMM.  It is the default and is used 
in all the edgeR examples and case studies.

I don't know of any published comparative study showing better performance 
for the other methods.

TMM is not however designed to work well with very small numbers of genes 
(such as your toy example with 10 genes).  Actually, your toy example does 
not fit the assumptions of any the normalization methods because the 
majority of the genes (all but four in fact) are differentially expressed. 
I don't think you can learn much about the performance of the different 
methods on real data from this example.

If you think that TMM has given an incorrect result for a real dataset 
then I suggest that you send your data example offline to the TMM author, 
Mark Robinson, so that he can trouble-shoot.

There was no attachment with your email, and I don't think that you have 
examined the right thing to judge which is the better normalization.

Best wishes
Gordon

> Date: Fri, 4 Jul 2014 09:32:07 -0400
> From: Zhan Tianyu <sewen67 at gmail.com>
> To: bioconductor <bioconductor at r-project.org>
> Subject: [BioC] The difference between three methods in
> 	calcNormFacotors()	in edgeR
>
> Hello all,
>
>      I have a question concerning the calcNormFacotrs() in edgeR. There
> are three methods that I could choose from: "TMM", "RLE", and
> "upperquartile". I am wondering how could decide which one to use?
>
>      For example, consider a simple example like this: there are 10 genes
> in total, and 4 genes in two groups. Therefore, the counts data would be a
> 10*8 matrix, where each row is the gene, each column is the individual, and
> the 1-4 columns are the first group, 5-8 columns are the second group.
> Among the 10 genes, 60% genes are the differential genes: the counts of No.
> 3,4,5,6,8,9 in the first group are doubled, while others are the sample.
> Please see the attachments for this count data.
>
>      Then I generated the "group" factor via this command:
>      > grp <- as.factor(rep(0:1, each = 8/2))
>
>      After that, I generated the DGEList by:
>      > d <- DGEList(counts = counts, group = grp )
>
>       Then I calculated the normalization factor by edgeR:
>      >  n <- calcNormFactors(d)
>
>       By default, this function uses the "TMM" method. However, the
> normalization factors look like this:
>
> group               lib.size             norm.factors
>
> Sample1     0  5062446        1.1195829383593
>
> Sample2     0  5062340        0.8154739771400
>
> Sample3     0  5062444        1.1195827474525
>
> Sample4     0  5062466        1.1403164060313
>
> Sample5     1  3000123        0.9624162935534
>
> Sample6     1  2999992        0.9624163157255
>
> Sample7     1  2999977        0.9624169648716
> Sample8     1  3000156        0.9624160077253
>
>        I think it is weird, because normalization factors for individuals
> 1 and 2 are quite different (1.11958, and 0.81547). However, from the
> counts data, their counts are generally the same (Please see the attachment
> for counts data).
>
>        Then I tried the method of RLE method:
>        n <- calcNormFactors(d,method="RLE")
>
>         The results are:
>
> $samples
>
>        group   lib.size             norm.factors
>
> Sample1     0  5062446         1.0886765699045
>
> Sample2     0  5062340         1.0886508565338
>
> Sample3     0  5062444         1.0886766741626
>
> Sample4     0  5062466         1.0886750099086
>
> Sample5     1  3000123         0.9185446848068
>
> Sample6     1  2999992         0.9185578680804
>
> Sample7     1  2999977         0.9185624609049
>
> Sample8     1  3000156           0.9185437155777
>
>          I think this time the results are more reasonable. My question is
> how I decide which method to use? Why TMM gives a weird result?
>
>         Thank you.
>
>
> Best regards,
>
> sewen67

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list