[BioC] Normalization

Ryan C. Thompson rct at thompsonclan.org
Thu Feb 28 22:50:34 CET 2013


If you want to do an ANOVA-style test for differences across all three 
groups, I highly recommend that you look into learning how to use edgeR 
in its GLM mode. The edgeR User's Guide has all the information you 
need for this.

If you are doing ANOVA using another software, then it is not doing the 
same tests as edgeR, so you should expect different results. edgeR is 
designed more specifically to operate on the kind of data produced by 
RNA-seq. It makes certain assumptions based on the fact that you are 
dealing with discrete count data rather than a continuous measure of 
expression. A full discussion of the differences is too long to include 
here. The edgeR User's Guide has some information, and the edgeR 
publications have a more complete discussion.

-Ryan

On Thu 28 Feb 2013 12:04:26 PM PST, Vittoria Roncalli wrote:
> Hi Ryan,
>
> I am using R version 2.15.1 on Mac.
>
> Actually I have run the ANOVA using another software, because I am not
> familiar with the GLM mode.
> The codes I used to get the exact test and the DGE are the following
>
>
> "# Estimate common dispersion
>
> d <- estimateCommonDisp(d)
>
>
> pseudo_counts<- data.frame(rownames(d$pseudo.alt),d$pseudo.alt)
>
> names(pseudo_counts) = c("A1","A2","A3","B1","B2","B3","C1","C2","C3")
>
> write.table(pseudo_counts,"pseudo_counts.txt",sep="\t",row=F,quote=F)
>
>
> RPD2_vs_UPD2_de.com<- exactTest(d, pair=c("A","B"))
>
>
> # Check to see if any NA values are present in results
>
> RPD2_vs_UPD2_de.com<- exactTest(d, pair=c("A","B"))
>
> dim(subset(RPD2_vs_UPD2_de.com$table,is.na
> <http://is.na>(RPD2_vs_UPD2_de.com$table$p.value)))
>
>
> # Top Tags
>
> RPD2_vs_UPD2_results<- topTags(RPD2_vs_UPD2_de.com, n=26788)
>
>
> RPD2_vs_UPD2_results.output<-
> merge(RPD2_vs_UPD2_results$table,d$pseudo.counts,by.x=0,by.y=0)
>
> names(RPD2_vs_UPD2_results.output) <-
> c("ID","logConc","logFC","p.val","adj.P.val","H1L","H2L","H8L","H9L","C4L","C5L","C8L","C9L")
>
>
> sum(p.adjust(RPD2_vs_UPD2_de.com$table$PValue, method="BH") < 0.05)
>
>
> #1038
>
>
> top.com <http://top.com><- topTags(RPD2_vs_UPD2_de.com, n=1038)
>
>
> sum(top.com <http://top.com>$table$logFC> 0)
>
> # 638
>
>
> sum(top.com <http://top.com>$table$logFC< 0)
>
> # 400 "
>
>
> Any ideas on how I can get the list of DGE genes even if I can't find
> significance with ANOVA?
>
>
>
>
>
>
> On Thu, Feb 28, 2013 at 9:44 AM, Ryan C. Thompson
> <rct at thompsonclan.org <mailto:rct at thompsonclan.org>> wrote:
>
>     Oh, I just realized you are using the non-GLM-based mode of
>     operation for edgeR. I am much more familiar with the GLM
>     workflow, and I believe that the GLM-based workflow is now
>     preferred over the exactTest-based one. In fact, I'm not even sure
>     how to do an ANOVA-style comparision of 3 or more groups using
>     exactTest.
>
>     In any case, the best way to describe what you are tyring to do is
>     to is to show the code you are using. The answers could depend on
>     what options you are using, how you are calculating dispersions,
>     and many other small factors. Also please tell us which versions
>     of R, and edgeR you are using.
>
>
>     On Thu 28 Feb 2013 11:38:04 AM PST, Ryan C. Thompson wrote:
>
>         Hi Vittoria,
>
>         It would be best if you could show code examples of what gave
>         you an
>         empty list and what gave you a list of differentially
>         expressed genes
>         and what code didn't. Whether you you are doing a pairwise
>         comparison
>         or a multi-way "ANOVA-style" comparison, edgeR is actually
>         performing
>         the same test. In general, if all three pairwise comparisons are
>         yielding significant hits, I would expect some significant
>         hits in the
>         three-way comparison as well.
>
>         -Ryan
>
>         On Thu 28 Feb 2013 11:26:17 AM PST, Vittoria Roncalli wrote:
>
>             Hi Ryan,
>
>             Thanks again for your explanation, you saved my day!
>             Considering your expertise, I would ask you another question.
>             I run on the raw data counts a simple one way anova (I have 3
>             treatments with 3 reps each) and I found out that there is no
>             significant difference between them. Then, with EdgeR I
>             was able, to
>             extract a list of DGE fro each pairwise comparison. Is
>             this because
>             the ANOVA is calculated on the overall library (total #
>             genes) while
>             the DGE comes from a t-test for each individual gene? I
>             found this
>             explanation on Bullard et al 2010, but I am not sure if I have
>             misunderstood something.
>
>             Does it make sense to you?
>
>             Have a good day,and thanks again for your help.
>
>             Vittoria
>
>             On Wed, Feb 27, 2013 at 9:48 PM, Ryan C. Thompson
>             <rct at thompsonclan.org <mailto:rct at thompsonclan.org>
>             <mailto:rct at thompsonclan.org
>             <mailto:rct at thompsonclan.org>>> wrote:
>
>                 Hi Vittoria,
>
>                 Please use "Reply All" so that your reply also goes to
>             the mailing
>                 list.
>
>                 The normalization factors are used to adjust the
>             library sizes (I
>                 forget the details, I believe they are given in the
>             User's Guide),
>                 and then the pseudo counts are obtained by normalizing
>             the counts
>                 to the adjusted library sizes. Since you have not used any
>                 normalization factors (i.e. all norm factors = 1), the
>             pseudo
>                 counts will simply be some constant factor of
>             counts-per-million,
>                 if I'm not mistaken. If you want absolutely no
>             normalization, you
>                 would have to set both the normalization factors and
>             library sizes
>                 to 1, I think.
>
>                 In any case, the pseudo counts are only for
>             descriptive purposes.
>                 The statistical testing in edgeR happens using the raw
>             integer
>             counts.
>
>
>                 On 02/27/2013 10:12 PM, Vittoria Roncalli wrote:
>
>                     Hi Ryan,
>
>                     thanks for your reply.
>                     I obtain pesudo.counts with the following commands
>
>                     "
>
>                     > raw.data <- read.table("counts
>                 2.txt",sep="\t",header=T)
>
>                     > d <- raw.data[, 2:10]
>
>                     > d[is.na <http://is.na> <http://is.na>(d)] <- 0
>
>                     > rownames(d) <- raw.data[, 1]
>
>                     > group <-
>                 c("CONTROL","CONTROL","__CONTROL","LD","LD","LD","HD","__HD","HD")
>
>                     > d <- DGEList(counts = d, group = group)
>
>                     Calculating library sizes from column totals.
>
>                     > keep <- rowSums (cpm(d)>1) >=3
>
>                     > d <- d[keep,]
>
>                     > dim(d)
>
>                     [1] 28755 9
>
>                     > d <- DGEList(counts = d, group = group)
>
>                     Calculating library sizes from column totals.
>
>                     > d <- estimateCommonDisp(d)
>
>
>                     After the common dispersion, I get in the DGE list
>
>                     $counts
>
>                     $samples
>
>                     $commondispersion
>
>                     $pseudo.counts
>
>                     $logCPM
>
>                     $pseudo.lib.size
>
>
>
>                     Then I write a table for the pseudo.counts and I
>                 will continue
>                     with those for the DGE.
>
>                     Considering that I did non normalize the
>                 libraries, what are the
>                     different counts in the pseudo.counts output?
>
>
>                     Thanks so much
>
>
>                     Vittoria
>                     On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson
>                     <rct at thompsonclan.org
>                 <mailto:rct at thompsonclan.org>
>                 <mailto:rct at thompsonclan.org
>                 <mailto:rct at thompsonclan.org>>> wrote:
>
>                         To answer your first question, when you first
>                 create a
>                         DGEList object, all the normalization factors
>                 are initially
>                         set to 1 by default. This is equivalent to no
>                 normalization.
>                         Once you use calcNormFactors, the
>                 normalization factors will
>                         be set appropriately.
>
>                         I'm not sure about the second question. Could
>                 you provide an
>                         example of how you are obtaining pseudocounts
>                 with edgeR?
>
>
>                         On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria
>                 Roncalli wrote:
>
>                             Hi, I am a edgeR user and I am a little
>                 bit confused on
>                             the normalization
>                             topic.
>                             I am using EdgeR to get different
>                 expressed genes within
>                             3 conditions
>                             (RnaSeq) with 3 replicates each.
>                             I am following the user guide step:
>
>                             -update counts file (from mapping against
>                 reference
>                             transcriptome)
>                             - filter the low counts reads (1cpm)
>                             - reassess library size
>                             - estimate common dispersion
>
>                             Mi first question is related to the
>                 normalization. Why,
>                             after I import my
>                             file, next to the library size there is
>                 then column with
>                             norm.factors?
>
>                             $samples
>
>                                               group lib.size norm.factors
>
>                             X48h_C_r1.sam  CONTROL 10898526            1
>
>                             X48h_C_r2.sam  CONTROL  7176817            1
>
>                             X48h_C_r3.sam  CONTROL  9511875            1
>
>                             X48h_LD_r1.sam      LD 11350347            1
>
>                             X48h_LD_r2.sam      LD 14836541            1
>
>                             X48h_LD_r3.sam      LD 12635344            1
>
>                             X48h_HD_r1.sam      HD 11840963            1
>
>                             X48h_HD_r2.sam      HD 17335549            1
>
>                             X48h_HD_r3.sam      HD 10274526            1
>
>
>
>                             Is the normalization automated? What is
>                 the difference
>                             with the
>                             "calNormFactors?"
>
>                             Moreover, if I do not run the
>                 calNormFactors, what is
>                             into the
>                             pseudo.counts output?
>
>
>                             I am very confused about those points.
>
>
>                             Thanks in advance for your help.
>
>
>                             Looking forward to hearing from you.
>
>
>                             Vittoria
>
>
>
>                 _________________________________________________
>                             Bioconductor mailing list
>                 Bioconductor at r-project.org
>                 <mailto:Bioconductor at r-project.org>
>                             <mailto:Bioconductor at r-__project.org
>                 <mailto:Bioconductor at r-project.org>>
>                 https://stat.ethz.ch/mailman/__listinfo/bioconductor
>                 <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>                             Search the archives:
>
>                 http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>                 <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>                     --
>
>                     Vittoria Roncalli
>
>                     Graduate Research Assistant
>                     Center Békésy Laboratory of Neurobiology
>                     Pacific Biosciences Research Center
>                     University of Hawaii at Manoa
>                     1993 East-West Road
>                     Honolulu, HI 96822 USA
>
>                     Tel: 808-4695693 <tel:808-4695693>
>                 <tel:808-4695693 <tel:808-4695693>>
>
>
>
>
>
>             --
>
>             Vittoria Roncalli
>
>             Graduate Research Assistant
>             Center Békésy Laboratory of Neurobiology
>             Pacific Biosciences Research Center
>             University of Hawaii at Manoa
>             1993 East-West Road
>             Honolulu, HI 96822 USA
>
>             Tel: 808-4695693 <tel:808-4695693>
>
>
>
>
> --
>
> Vittoria Roncalli
>
> Graduate Research Assistant
> Center Békésy Laboratory of Neurobiology
> Pacific Biosciences Research Center
> University of Hawaii at Manoa
> 1993 East-West Road
> Honolulu, HI 96822 USA
>
> Tel: 808-4695693
>



More information about the Bioconductor mailing list