[BioC] Single nucleotide based RNAseq normalization with edgeR

Wed Feb 9 23:16:35 CET 2011

Hi Sridhara.

On 2011-02-10, at 4:34 AM, Sridhara Gupta Kunjeti wrote:

> Hello Mark,
> This is in continuation with the normalization of the counts:
> did you mean
> 
> (count / library size) * Norm.factor
> Can I use the numbers for the library size and Norm.factor can be used from the edgeR?

No.  Actually, I mean what I wrote in both previous posts.  I'll repeat again and hopefully third time lucky:

rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6

So, this translates to:

count / (lib.size*Norm.factor)

... and you may multiply by a factor to put it on a different scale (e.g. multiply by 1M as I've done above).  And, you should remember all the previous caveats that I've mentioned (i.e. there is no need to do this for a differential expression analysis as edgeR already builds this in + this doesn't account for other biases such as gene length).

Hope that helps.
Mark

> Thanks,
> Sridhara
> 
> 
> On Mon, Feb 7, 2011 at 5:11 PM, Mark Robinson <mrobinson at wehi.edu.au> wrote:
> Hi Jens/Sridhara.
> 
> A few thoughts below.
> 
> On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote:
> 
> > Hi Gordon,
> > First I would like to thank Jens for asking the questions that I had asked
> > few days ago.
> > In additions to the Jens question, I have one more question on my RNA-seq
> > data
> > 1. I would like to know if I can multiply the counts for each gene with the
> > norm.factor (calculated by "calcNormFactors( )" function)
> 
> 
> Sridhara, you've asked this exact question before and I answered (short answer is: NO to multiplying ... instead, divide by [library size]*[normalization factor]):
> 
> https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html
> https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html
> 
> Perhaps you can clarify what you don't understand.
> 
> 
> > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg <
> > jens.georg at biologie.uni-freiburg.de> wrote:
> >
> >> Hi Gordon,
> >> thank you for your reply. The resolution of our ~100nt solexa reads is to
> >> small to detect individual processing sites, so we want to investigate every
> >> single nucleotide individually ("single nucleotide based normalization").
> >> That means that we count, how often an individual nucleotide is covered by
> >> sequence reads. Of course, this approach will virtually increase the
> >> lib.size by a factor which depends on length of the solexa reads. As the
> >> lib.size is critical for the normalization, I am not sure if I should use
> >> the original read numbers for each library or the read numbers multiplicated
> >> with the read length to adjust for the single nucleotide investigation.
> 
> 
> So basically, by counting this way, your library size is ~100x the number of reads you've actually mapped.  While I think this will work out ok (normalization calculation be fine), this coverage calculation does impose a (strong?) dependence between adjacent nucleotides.  One alternative would be to count the reads that *begin* at a given nucleotide and only consider these.  Then your library sizes are as normal.
> 
> 
> >> I have two more question regarding to the normalization:
> >> 1. Are the norm factors calculated by the calcNormFactors( ) function
> >> automatically used for further steps like the estimateCommonDisp( )
> >> function?
> 
> Yes.
> 
> 
> >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized
> >> readcounts?
> 
> Yes, but this is only accounting for overall depth and potential composition biases, not for length biases (or any others).  It is with the intention of making inferences of a given gene across conditions.  The inferences for differential expression are still done on the raw counts.
> 
> Hope that helps.
> Mark
> 
> 
> 
> 
> >>
> >> Many thanks
> >>
> >> Jens
> >>
> >> Hi Jens,
> >>>
> >>> I don't know what you mean by single nucleotide based normalization,
> >>> however the following comments may be helpful.
> >>>
> >>> edgeR automatically adjusts for library sizes, whether you include an
> >>> explicit normalization step or not.  Normalization is a separate issue, and
> >>> is intended to deal with more subtle issues.
> >>>
> >>> Normalization, as edgeR does it, does not require replicates.
> >>>
> >>> Best wishes
> >>> Gordon
> >>>
> >>> Date: Fri, 04 Feb 2011 11:28:15 +0100
> >>>> From: Jens Georg <jens.georg at biologie.uni-freiburg.de>
> >>>> To: bioconductor at r-project.org
> >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with
> >>>>   edgeR?
> >>>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de>
> >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> >>>>
> >>>>
> >>>>
> >>>> Dear edgeR users and developers,
> >>>>
> >>>> we used Solexa sequencing in order to detect RNase E processing sites.
> >>>> Therefor we splitted a RNA sample and treated one half with RNase E
> >>>> prior to cDNA synthesis and sequencing. The libraries differ in size
> >>>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates
> >>>> a normalization step. Furthermore we expect site specific differences
> >>>> rather than differences in the accumulation of the full length RNAs.
> >>>>
> >>>> So I want to ask, if it is appropiate to do a single nucleotide based
> >>>> normalization with edgeR and do you think a reliable basic normalization
> >>>> is possible without replicates?
> >>>>
> >>>> Thank you for your comments.
> >>>>
> >>>> Best regards
> >>>>
> >>>> Jens
> >>>>
> >>>
> >>> ______________________________________________________________________
> >>> The information in this email is confidential and inte...{{dropped:6}}
> >>>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> >
> >
> > --
> > Sridhara G Kunjeti
> > PhD Candidate
> > University of Delaware
> > Department of Plant and Soil Science
> > email- sridhara at udel.edu
> > Ph: 832-566-0011
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ------------------------------
> Mark Robinson, PhD (Melb)
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: mrobinson at wehi.edu.au
> e: m.robinson at garvan.org.au
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
> ------------------------------
> 
> 
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the addressee.
> You must not disclose, forward, print or use it without the permission of the sender.
> ______________________________________________________________________
> 
> 
> 
> -- 
> Sridhara G Kunjeti
> PhD Candidate
> University of Delaware
> Department of Plant and Soil Science
> email- sridhara at udel.edu
> Ph: 832-566-0011

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: mrobinson at wehi.edu.au
e: m.robinson at garvan.org.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------

______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.