[BioC] Single nucleotide based RNAseq normalization with edgeR

Mark Robinson mrobinson at wehi.EDU.AU
Wed Feb 9 23:16:35 CET 2011


Hi Sridhara.

On 2011-02-10, at 4:34 AM, Sridhara Gupta Kunjeti wrote:

> Hello Mark,
> This is in continuation with the normalization of the counts:
> did you mean
> 
> (count / library size) * Norm.factor
> Can I use the numbers for the library size and Norm.factor can be used from the edgeR?


No.  Actually, I mean what I wrote in both previous posts.  I'll repeat again and hopefully third time lucky:

rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6

So, this translates to:

count / (lib.size*Norm.factor)

... and you may multiply by a factor to put it on a different scale (e.g. multiply by 1M as I've done above).  And, you should remember all the previous caveats that I've mentioned (i.e. there is no need to do this for a differential expression analysis as edgeR already builds this in + this doesn't account for other biases such as gene length).

Hope that helps.
Mark




> Thanks,
> Sridhara
> 
> 
> On Mon, Feb 7, 2011 at 5:11 PM, Mark Robinson <mrobinson at wehi.edu.au> wrote:
> Hi Jens/Sridhara.
> 
> A few thoughts below.
> 
> On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote:
> 
> > Hi Gordon,
> > First I would like to thank Jens for asking the questions that I had asked
> > few days ago.
> > In additions to the Jens question, I have one more question on my RNA-seq
> > data
> > 1. I would like to know if I can multiply the counts for each gene with the
> > norm.factor (calculated by "calcNormFactors( )" function)
> 
> 
> Sridhara, you've asked this exact question before and I answered (short answer is: NO to multiplying ... instead, divide by [library size]*[normalization factor]):
> 
> https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html
> https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html
> 
> Perhaps you can clarify what you don't understand.
> 
> 
> > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg <
> > jens.georg at biologie.uni-freiburg.de> wrote:
> >
> >> Hi Gordon,
> >> thank you for your reply. The resolution of our ~100nt solexa reads is to
> >> small to detect individual processing sites, so we want to investigate every
> >> single nucleotide individually ("single nucleotide based normalization").
> >> That means that we count, how often an individual nucleotide is covered by
> >> sequence reads. Of course, this approach will virtually increase the
> >> lib.size by a factor which depends on length of the solexa reads. As the
> >> lib.size is critical for the normalization, I am not sure if I should use
> >> the original read numbers for each library or the read numbers multiplicated
> >> with the read length to adjust for the single nucleotide investigation.
> 
> 
> So basically, by counting this way, your library size is ~100x the number of reads you've actually mapped.  While I think this will work out ok (normalization calculation be fine), this coverage calculation does impose a (strong?) dependence between adjacent nucleotides.  One alternative would be to count the reads that *begin* at a given nucleotide and only consider these.  Then your library sizes are as normal.
> 
> 
> >> I have two more question regarding to the normalization:
> >> 1. Are the norm factors calculated by the calcNormFactors( ) function
> >> automatically used for further steps like the estimateCommonDisp( )
> >> function?
> 
> Yes.
> 
> 
> >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized
> >> readcounts?
> 
> Yes, but this is only accounting for overall depth and potential composition biases, not for length biases (or any others).  It is with the intention of making inferences of a given gene across conditions.  The inferences for differential expression are still done on the raw counts.
> 
> Hope that helps.
> Mark
> 
> 
> 
> 
> >>
> >> Many thanks
> >>
> >> Jens
> >>
> >> Hi Jens,
> >>>
> >>> I don't know what you mean by single nucleotide based normalization,
> >>> however the following comments may be helpful.
> >>>
> >>> edgeR automatically adjusts for library sizes, whether you include an
> >>> explicit normalization step or not.  Normalization is a separate issue, and
> >>> is intended to deal with more subtle issues.
> >>>
> >>> Normalization, as edgeR does it, does not require replicates.
> >>>
> >>> Best wishes
> >>> Gordon
> >>>
> >>> Date: Fri, 04 Feb 2011 11:28:15 +0100
> >>>> From: Jens Georg <jens.georg at biologie.uni-freiburg.de>
> >>>> To: bioconductor at r-project.org
> >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with
> >>>>   edgeR?
> >>>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de>
> >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> >>>>
> >>>>
> >>>>
> >>>> Dear edgeR users and developers,
> >>>>
> >>>> we used Solexa sequencing in order to detect RNase E processing sites.
> >>>> Therefor we splitted a RNA sample and treated one half with RNase E
> >>>> prior to cDNA synthesis and sequencing. The libraries differ in size
> >>>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates
> >>>> a normalization step. Furthermore we expect site specific differences
> >>>> rather than differences in the accumulation of the full length RNAs.
> >>>>
> >>>> So I want to ask, if it is appropiate to do a single nucleotide based
> >>>> normalization with edgeR and do you think a reliable basic normalization
> >>>> is possible without replicates?
> >>>>
> >>>> Thank you for your comments.
> >>>>
> >>>> Best regards
> >>>>
> >>>> Jens
> >>>>
> >>>
> >>> ______________________________________________________________________
> >>> The information in this email is confidential and inte...{{dropped:6}}
> >>>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> >
> >
> > --
> > Sridhara G Kunjeti
> > PhD Candidate
> > University of Delaware
> > Department of Plant and Soil Science
> > email- sridhara at udel.edu
> > Ph: 832-566-0011
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ------------------------------
> Mark Robinson, PhD (Melb)
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: mrobinson at wehi.edu.au
> e: m.robinson at garvan.org.au
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
> ------------------------------
> 
> 
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the addressee.
> You must not disclose, forward, print or use it without the permission of the sender.
> ______________________________________________________________________
> 
> 
> 
> -- 
> Sridhara G Kunjeti
> PhD Candidate
> University of Delaware
> Department of Plant and Soil Science
> email- sridhara at udel.edu
> Ph: 832-566-0011

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: mrobinson at wehi.edu.au
e: m.robinson at garvan.org.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------


______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.



More information about the Bioconductor mailing list