[BioC] Normalization

Thu Feb 28 20:44:04 CET 2013

Oh, I just realized you are using the non-GLM-based mode of operation 
for edgeR. I am much more familiar with the GLM workflow, and I believe 
that the GLM-based workflow is now preferred over the exactTest-based 
one. In fact, I'm not even sure how to do an ANOVA-style comparision of 
3 or more groups using exactTest.

In any case, the best way to describe what you are tyring to do is to 
is to show the code you are using. The answers could depend on what 
options you are using, how you are calculating dispersions, and many 
other small factors. Also please tell us which versions of R, and edgeR 
you are using.

On Thu 28 Feb 2013 11:38:04 AM PST, Ryan C. Thompson wrote:
> Hi Vittoria,
>
> It would be best if you could show code examples of what gave you an
> empty list and what gave you a list of differentially expressed genes
> and what code didn't. Whether you you are doing a pairwise comparison
> or a multi-way "ANOVA-style" comparison, edgeR is actually performing
> the same test. In general, if all three pairwise comparisons are
> yielding significant hits, I would expect some significant hits in the
> three-way comparison as well.
>
> -Ryan
>
> On Thu 28 Feb 2013 11:26:17 AM PST, Vittoria Roncalli wrote:
>> Hi Ryan,
>>
>> Thanks again for your explanation, you saved my day!
>> Considering your expertise, I would ask you another question.
>> I run on the raw data counts a simple one way anova (I have 3
>> treatments with 3 reps each) and I found out that there is no
>> significant difference between them. Then, with EdgeR I was able, to
>> extract a list of DGE fro each pairwise comparison. Is this because
>> the ANOVA is calculated on the overall library (total # genes) while
>> the DGE comes from a t-test for each individual gene? I found this
>> explanation on Bullard et al 2010, but I am not sure if I have
>> misunderstood something.
>>
>> Does it make sense to you?
>>
>> Have a good day,and thanks again for your help.
>>
>> Vittoria
>>
>> On Wed, Feb 27, 2013 at 9:48 PM, Ryan C. Thompson
>> <rct at thompsonclan.org <mailto:rct at thompsonclan.org>> wrote:
>>
>>     Hi Vittoria,
>>
>>     Please use "Reply All" so that your reply also goes to the mailing
>>     list.
>>
>>     The normalization factors are used to adjust the library sizes (I
>>     forget the details, I believe they are given in the User's Guide),
>>     and then the pseudo counts are obtained by normalizing the counts
>>     to the adjusted library sizes. Since you have not used any
>>     normalization factors (i.e. all norm factors = 1), the pseudo
>>     counts will simply be some constant factor of counts-per-million,
>>     if I'm not mistaken. If you want absolutely no normalization, you
>>     would have to set both the normalization factors and library sizes
>>     to 1, I think.
>>
>>     In any case, the pseudo counts are only for descriptive purposes.
>>     The statistical testing in edgeR happens using the raw integer
>> counts.
>>
>>
>>     On 02/27/2013 10:12 PM, Vittoria Roncalli wrote:
>>>     Hi Ryan,
>>>
>>>     thanks for your reply.
>>>     I obtain pesudo.counts with the following commands
>>>
>>>     "
>>>
>>>     > raw.data <- read.table("counts 2.txt",sep="\t",header=T)
>>>
>>>     > d <- raw.data[, 2:10]
>>>
>>>     > d[is.na <http://is.na>(d)] <- 0
>>>
>>>     > rownames(d) <- raw.data[, 1]
>>>
>>>     > group <-
>>> c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD")
>>>
>>>     > d <- DGEList(counts = d, group = group)
>>>
>>>     Calculating library sizes from column totals.
>>>
>>>     > keep <- rowSums (cpm(d)>1) >=3
>>>
>>>     > d <- d[keep,]
>>>
>>>     > dim(d)
>>>
>>>     [1] 28755 9
>>>
>>>     > d <- DGEList(counts = d, group = group)
>>>
>>>     Calculating library sizes from column totals.
>>>
>>>     > d <- estimateCommonDisp(d)
>>>
>>>
>>>     After the common dispersion, I get in the DGE list
>>>
>>>     $counts
>>>
>>>     $samples
>>>
>>>     $commondispersion
>>>
>>>     $pseudo.counts
>>>
>>>     $logCPM
>>>
>>>     $pseudo.lib.size
>>>
>>>
>>>
>>>     Then I write a table for the pseudo.counts and I will continue
>>>     with those for the DGE.
>>>
>>>     Considering that I did non normalize the libraries, what are the
>>>     different counts in the pseudo.counts output?
>>>
>>>
>>>     Thanks so much
>>>
>>>
>>>     Vittoria
>>>     On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson
>>>     <rct at thompsonclan.org <mailto:rct at thompsonclan.org>> wrote:
>>>
>>>         To answer your first question, when you first create a
>>>         DGEList object, all the normalization factors are initially
>>>         set to 1 by default. This is equivalent to no normalization.
>>>         Once you use calcNormFactors, the normalization factors will
>>>         be set appropriately.
>>>
>>>         I'm not sure about the second question. Could you provide an
>>>         example of how you are obtaining pseudocounts with edgeR?
>>>
>>>
>>>         On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote:
>>>
>>>             Hi, I am a edgeR user and I am a little bit confused on
>>>             the normalization
>>>             topic.
>>>             I am using EdgeR to get different expressed genes within
>>>             3 conditions
>>>             (RnaSeq) with 3 replicates each.
>>>             I am following the user guide step:
>>>
>>>             -update counts file (from mapping against reference
>>>             transcriptome)
>>>             - filter the low counts reads (1cpm)
>>>             - reassess library size
>>>             - estimate common dispersion
>>>
>>>             Mi first question is related to the normalization. Why,
>>>             after I import my
>>>             file, next to the library size there is then column with
>>>             norm.factors?
>>>
>>>             $samples
>>>
>>>                               group lib.size norm.factors
>>>
>>>             X48h_C_r1.sam  CONTROL 10898526            1
>>>
>>>             X48h_C_r2.sam  CONTROL  7176817            1
>>>
>>>             X48h_C_r3.sam  CONTROL  9511875            1
>>>
>>>             X48h_LD_r1.sam      LD 11350347            1
>>>
>>>             X48h_LD_r2.sam      LD 14836541            1
>>>
>>>             X48h_LD_r3.sam      LD 12635344            1
>>>
>>>             X48h_HD_r1.sam      HD 11840963            1
>>>
>>>             X48h_HD_r2.sam      HD 17335549            1
>>>
>>>             X48h_HD_r3.sam      HD 10274526            1
>>>
>>>
>>>
>>>             Is the normalization automated? What is the difference
>>>             with the
>>>             "calNormFactors?"
>>>
>>>             Moreover, if I do not run the calNormFactors, what is
>>>             into the
>>>             pseudo.counts output?
>>>
>>>
>>>             I am very confused about those points.
>>>
>>>
>>>             Thanks in advance for your help.
>>>
>>>
>>>             Looking forward to hearing from you.
>>>
>>>
>>>             Vittoria
>>>
>>>
>>>             _______________________________________________
>>>             Bioconductor mailing list
>>>             Bioconductor at r-project.org
>>>             <mailto:Bioconductor at r-project.org>
>>>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>             Search the archives:
>>>
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>>
>>>     --
>>>
>>>     Vittoria Roncalli
>>>
>>>     Graduate Research Assistant
>>>     Center Békésy Laboratory of Neurobiology
>>>     Pacific Biosciences Research Center
>>>     University of Hawaii at Manoa
>>>     1993 East-West Road
>>>     Honolulu, HI 96822 USA
>>>
>>>     Tel: 808-4695693 <tel:808-4695693>
>>>
>>
>>
>>
>>
>> --
>>
>> Vittoria Roncalli
>>
>> Graduate Research Assistant
>> Center Békésy Laboratory of Neurobiology
>> Pacific Biosciences Research Center
>> University of Hawaii at Manoa
>> 1993 East-West Road
>> Honolulu, HI 96822 USA
>>
>> Tel: 808-4695693
>>