[BioC] Single nucleotide based RNAseq normalization with edgeR

Mon Feb 7 23:11:49 CET 2011

Hi Jens/Sridhara.

A few thoughts below.

On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote:

> Hi Gordon,
> First I would like to thank Jens for asking the questions that I had asked
> few days ago.
> In additions to the Jens question, I have one more question on my RNA-seq
> data
> 1. I would like to know if I can multiply the counts for each gene with the
> norm.factor (calculated by "calcNormFactors( )" function)

Sridhara, you've asked this exact question before and I answered (short answer is: NO to multiplying ... instead, divide by [library size]*[normalization factor]):

https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html
https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html

Perhaps you can clarify what you don't understand.

> On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg <
> jens.georg at biologie.uni-freiburg.de> wrote:
> 
>> Hi Gordon,
>> thank you for your reply. The resolution of our ~100nt solexa reads is to
>> small to detect individual processing sites, so we want to investigate every
>> single nucleotide individually ("single nucleotide based normalization").
>> That means that we count, how often an individual nucleotide is covered by
>> sequence reads. Of course, this approach will virtually increase the
>> lib.size by a factor which depends on length of the solexa reads. As the
>> lib.size is critical for the normalization, I am not sure if I should use
>> the original read numbers for each library or the read numbers multiplicated
>> with the read length to adjust for the single nucleotide investigation.

So basically, by counting this way, your library size is ~100x the number of reads you've actually mapped.  While I think this will work out ok (normalization calculation be fine), this coverage calculation does impose a (strong?) dependence between adjacent nucleotides.  One alternative would be to count the reads that *begin* at a given nucleotide and only consider these.  Then your library sizes are as normal.

>> I have two more question regarding to the normalization:
>> 1. Are the norm factors calculated by the calcNormFactors( ) function
>> automatically used for further steps like the estimateCommonDisp( )
>> function?

Yes.

>> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized
>> readcounts?

Yes, but this is only accounting for overall depth and potential composition biases, not for length biases (or any others).  It is with the intention of making inferences of a given gene across conditions.  The inferences for differential expression are still done on the raw counts.

Hope that helps.
Mark

>> 
>> Many thanks
>> 
>> Jens
>> 
>> Hi Jens,
>>> 
>>> I don't know what you mean by single nucleotide based normalization,
>>> however the following comments may be helpful.
>>> 
>>> edgeR automatically adjusts for library sizes, whether you include an
>>> explicit normalization step or not.  Normalization is a separate issue, and
>>> is intended to deal with more subtle issues.
>>> 
>>> Normalization, as edgeR does it, does not require replicates.
>>> 
>>> Best wishes
>>> Gordon
>>> 
>>> Date: Fri, 04 Feb 2011 11:28:15 +0100
>>>> From: Jens Georg <jens.georg at biologie.uni-freiburg.de>
>>>> To: bioconductor at r-project.org
>>>> Subject: [BioC] Single nucleotide based RNAseq normalization with
>>>>   edgeR?
>>>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de>
>>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
>>>> 
>>>> 
>>>> 
>>>> Dear edgeR users and developers,
>>>> 
>>>> we used Solexa sequencing in order to detect RNase E processing sites.
>>>> Therefor we splitted a RNA sample and treated one half with RNase E
>>>> prior to cDNA synthesis and sequencing. The libraries differ in size
>>>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates
>>>> a normalization step. Furthermore we expect site specific differences
>>>> rather than differences in the accumulation of the full length RNAs.
>>>> 
>>>> So I want to ask, if it is appropiate to do a single nucleotide based
>>>> normalization with edgeR and do you think a reliable basic normalization
>>>> is possible without replicates?
>>>> 
>>>> Thank you for your comments.
>>>> 
>>>> Best regards
>>>> 
>>>> Jens
>>>> 
>>> 
>>> ______________________________________________________________________
>>> The information in this email is confidential and inte...{{dropped:6}}
>>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 
> 
> -- 
> Sridhara G Kunjeti
> PhD Candidate
> University of Delaware
> Department of Plant and Soil Science
> email- sridhara at udel.edu
> Ph: 832-566-0011
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: mrobinson at wehi.edu.au
e: m.robinson at garvan.org.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}