[BioC] normalization factors for ChIP/RNA-IP-seq data

Sun Jan 8 17:20:34 CET 2012

Hi Mali,

> OK, if updating DGEList$samples$lib.size is the way of using original
> library sizes, than I know how to do it, but still I'm not sure if this is
> the right way to go with this kind of IP-Input normalization

Yes, you can manually modify the lib.size and norm.factors elements.  The product of these is used as the "effective" library size (i.e. similar to DESeq's sizeFactors).

I'd be inclined to look at M-vs-A / "smear" plots -- plotSmear() or maPlot() or similar -- to get a feel for what the normalization factors are actually doing.  Have you done this?

>> As you can see, DESeq and edgeR are weighting-up Input samples and
>> weighting-down IP. I suppose this is due to the fact that many less Input
>> reads are found in peak regions compared to IP which makes DESeq and edgeR
>> to think that the Input library size is much lower than IP.

My interpretation of this is that the Input-seq populations are more diverse, so you are sequencing them to a lower depth (on average, relative to total).

> In fact, the
> original library size of Input samples is in most cases  larger than the IP.

How was the peak detection done?  That may have an influence too.

Anyways, I don't think you can decide on the "right way" without a serious look at the data.

Regards,
Mark

----------
Prof. Dr. Mark Robinson
Bioinformatics
Institute of Molecular Life Sciences
University of Zurich
Winterthurerstrasse 190
8057 Zurich
Switzerland

v: +41 44 635 4848
f: +41 44 635 6898
e: mark.robinson at imls.uzh.ch
o: Y32-J-34
w: http://tiny.cc/mrobin

On 08.01.2012, at 16:58, mali salmon wrote:

> OK, if updating DGEList$samples$lib.size is the way of using original
> library sizes, than I know how to do it, but still I'm not sure if this is
> the right way to go with this kind of IP-Input normalization
> Mali
> 
> On Sun, Jan 8, 2012 at 5:43 PM, mali salmon <shalmom1 at gmail.com> wrote:
> 
>> Dear List
>> I have peak counts from RNA-IP samples and corresponding inputs, for two
>> different conditions.
>> I would like to find DE-binding between the two IP conditions after
>> removing the differential expression effect.
>> In a previous post (titled "differential binding question") Mark Robinson
>> suggested to do GLM analysis.
>> Before doing the DE analysis I have to normalize the data.
>> 
>> Using DESeq "estimateSizeFactors" function I get the following sizeFactors
>> 
>>> sizeFactors( cds )
>>     cond1_IP    cond1_IP.1   cond1_Input cond1_Input.1         cond2_IP
>>    6.3672619     6.1015548     0.3209480     0.2553967     3.2300114
>>    cond2_IP.1       cond2_IP.2      cond2_Input    cond2_Input.1
>>    1.7808445     1.7027369     0.2480639     0.2530747
>> 
>> With edgeR, these are the normalize factors I get using both TMM and RLE
>> methods
>>> dTMM$samples
>>              group lib.size norm.factors
>> cond1_IP     H  8345160    0.9916792
>> cond1_IP.1    H  9395446    1.2221615
>> cond1_Input   H  1126656    0.4489350
>> cond1_Input.1 H   219823    2.1955057
>> cond2_IP            S  5707895    0.8339317
>> cond2_IP.1          S  5914904    0.5014391
>> cond2_IP.2          S  5602070    0.5043970
>> cond2_Input         S   223442    1.9909578
>> cond2_Input.1       S   226840    1.9934207
>> 
>>> dRLE$samples
>>              group lib.size norm.factors
>> cond1_IP      H  8345160    1.2656111
>> cond1_IP.1    H  9395446    1.0772223
>> cond1_Input   H  1126656    0.4725259
>> cond1_Input.1 H   219823    1.9271892
>> cond2_IP            S  5707895    0.9386643
>> cond2_IP.1          S  5914904    0.4994138
>> cond2_IP.2          S  5602070    0.5041749
>> cond2_Input         S   223442    1.8415393
>> cond2_Input.1       S   226840    1.8505947
>> 
>> 
>> The "real" library size (number of reads that have been successfully
>> aligned in each sample) are
>> cond1_IP       24055908
>> cond1_IP    16654296
>> cond1_lnput   12919153
>> cond1_Input    33778948
>> cond2_IP    17340233
>> cond2_IP    29284664
>> cond2_IP    27788144
>> cond2_Input 33477921
>> cond2_Input 33980303
>> 
>> As you can see, DESeq and edgeR are weighting-up Input samples and
>> weighting-down IP. I suppose this is due to the fact that many less Input
>> reads are found in peak regions compared to IP which makes DESeq and edgeR
>> to think that the Input library size is much lower than IP. In fact, the
>> original library size of Input samples is in most cases  larger than the IP.
>> 
>> What do you think, shall I use the original library sizes as normalization
>> factors instead of the calculated ones? I know this is possible with DESeq,
>> but I couldn't find how to do it with edgeR.
>> 
>> Thanks
>> Mali
>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor