[BioC] Normalization

Gordon K Smyth smyth at wehi.EDU.AU
Fri Mar 1 08:52:05 CET 2013


Hi Ryan,

Everything else you say is correct, but the pseudo counts are not linearly 
related to counts-per-million, even when the norm factors are all 1. 
Their definition and purpose is described in Robinson and Smyth 
(Biostatistics, 2008).

Pseudo counts are used internally by edgeR to estimate the dispersions and 
to compute the exact tests.  They do not have a simple interpretation as 
normalized counts because they depend on the experimental design as well 
as on the library sizes.  We do not recommend that they are used by users 
for other purposes.

For descriptive purposes, users should use cpm() or similar.

Best wishes
Gordon

> Date: Wed, 27 Feb 2013 23:48:34 -0800
> From: "Ryan C. Thompson" <rct at thompsonclan.org>
> To: Vittoria Roncalli <roncalli at hawaii.edu>
> Cc: bioconductor <Bioconductor at r-project.org>
> Subject: Re: [BioC] Normalization
>
> Hi Vittoria,
>
> Please use "Reply All" so that your reply also goes to the mailing list.
>
> The normalization factors are used to adjust the library sizes (I forget 
> the details, I believe they are given in the User's Guide), and then the 
> pseudo counts are obtained by normalizing the counts to the adjusted 
> library sizes. Since you have not used any normalization factors (i.e. 
> all norm factors = 1), the pseudo counts will simply be some constant 
> factor of counts-per-million, if I'm not mistaken. If you want 
> absolutely no normalization, you would have to set both the 
> normalization factors and library sizes to 1, I think.
>
> In any case, the pseudo counts are only for descriptive purposes. The 
> statistical testing in edgeR happens using the raw integer counts.
>
> On 02/27/2013 10:12 PM, Vittoria Roncalli wrote:
>> Hi Ryan,
>>
>> thanks for your reply.
>> I obtain pesudo.counts with the following commands
>>
>> "
>>
>>> raw.data <- read.table("counts 2.txt",sep="\t",header=T)
>>
>>> d <- raw.data[, 2:10]
>>
>>> d[is.na <http://is.na>(d)] <- 0
>>
>>> rownames(d) <- raw.data[, 1]
>>
>>> group <- c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD")
>>
>>> d <- DGEList(counts = d, group = group)
>>
>> Calculating library sizes from column totals.
>>
>>> keep <- rowSums (cpm(d)>1) >=3
>>
>>> d <- d[keep,]
>>
>>> dim(d)
>>
>> [1] 28755 9
>>
>>> d <- DGEList(counts = d, group = group)
>>
>> Calculating library sizes from column totals.
>>
>>> d <- estimateCommonDisp(d)
>>
>>
>> After the common dispersion, I get in the DGE list
>>
>> $counts
>>
>> $samples
>>
>> $commondispersion
>>
>> $pseudo.counts
>>
>> $logCPM
>>
>> $pseudo.lib.size
>>
>>
>>
>> Then I write a table for the pseudo.counts and I will continue with
>> those for the DGE.
>>
>> Considering that I did non normalize the libraries, what are the
>> different counts in the pseudo.counts output?
>>
>>
>> Thanks so much
>>
>>
>> Vittoria
>> On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson
>> <rct at thompsonclan.org <mailto:rct at thompsonclan.org>> wrote:
>>
>>     To answer your first question, when you first create a DGEList
>>     object, all the normalization factors are initially set to 1 by
>>     default. This is equivalent to no normalization. Once you use
>>     calcNormFactors, the normalization factors will be set appropriately.
>>
>>     I'm not sure about the second question. Could you provide an
>>     example of how you are obtaining pseudocounts with edgeR?
>>
>>
>>     On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote:
>>
>>         Hi, I am a edgeR user and I am a little bit confused on the
>>         normalization
>>         topic.
>>         I am using EdgeR to get different expressed genes within 3
>>         conditions
>>         (RnaSeq) with 3 replicates each.
>>         I am following the user guide step:
>>
>>         -update counts file (from mapping against reference transcriptome)
>>         - filter the low counts reads (1cpm)
>>         - reassess library size
>>         - estimate common dispersion
>>
>>         Mi first question is related to the normalization. Why, after
>>         I import my
>>         file, next to the library size there is then column with
>>         norm.factors?
>>
>>         $samples
>>
>>                           group lib.size norm.factors
>>
>>         X48h_C_r1.sam  CONTROL 10898526            1
>>
>>         X48h_C_r2.sam  CONTROL  7176817            1
>>
>>         X48h_C_r3.sam  CONTROL  9511875            1
>>
>>         X48h_LD_r1.sam      LD 11350347            1
>>
>>         X48h_LD_r2.sam      LD 14836541            1
>>
>>         X48h_LD_r3.sam      LD 12635344            1
>>
>>         X48h_HD_r1.sam      HD 11840963            1
>>
>>         X48h_HD_r2.sam      HD 17335549            1
>>
>>         X48h_HD_r3.sam      HD 10274526            1
>>
>>
>>
>>         Is the normalization automated? What is the difference with the
>>         "calNormFactors?"
>>
>>         Moreover, if I do not run the calNormFactors, what is into the
>>         pseudo.counts output?
>>
>>
>>         I am very confused about those points.
>>
>>
>>         Thanks in advance for your help.
>>
>>
>>         Looking forward to hearing from you.
>>
>>
>>         Vittoria
>>
>>
>>
>>
>> --
>>
>> Vittoria Roncalli
>>
>> Graduate Research Assistant
>> Center Békésy Laboratory of Neurobiology
>> Pacific Biosciences Research Center
>> University of Hawaii at Manoa
>> 1993 East-West Road
>> Honolulu, HI 96822 USA
>>
>> Tel: 808-4695693
>>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}


More information about the Bioconductor mailing list