[BioC] "Normalising" GRanges coverage object

Fri Aug 8 20:40:26 CEST 2014

Hi Aliaksei,

First of all, dividing by library size is not the way to go. It's wrong 
in ChIP-seq for the same reason it's wrong in RNA-seq: compositional 
bias. In order to properly normalize for sequencing depth, I would 
recommend the strategy described here: 
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066778/

You just need to look at the small section about calculating 
normalization factors using 10kb bins. Once you calculate the 
normalization factors, multiply them by the nominal library sizes to 
get the normalized library sizes. Then divide the raw coverage by the 
normalized library sizes to get the normalized coverage, which has been 
adjusted for both sequencing depth and compositional bias. You should 
be able to normalize input and ChIP samples together in this way as 
long as your ChIP peaks are not omnipresent throughout the entire 
genome.

 From there, you could theoretically subtract from each sample the 
corresponding input (or maybe the average of all inputs, since they 
should all be equivalent) to get enrichment above input, or divide to 
get fold enrichment over input. You are correct in your supposition 
that arithmetic operations on Rle objects (i.e. the type of object 
returned by coverage) work element-wise, just like arithmetic on 
ordinary R numeric vectors.

Of course you are correct that if the input coverage is zero at any 
point, then dividing by it will result in an infinite ratio, which you 
probably don't want. You could take another page out of the RNA-seq 
playbook and add a small constant to each coverage object before 
dividing to regularize the ratios. You could also smooth out the input 
coverage by replacing each coverage value with a mean over a window 
centered at that position. You can do this with the runmean function 
from the caTools package (from CRAN, not Bioconductor), using options 
endrule="mean" and align="center". Depending on the sparseness of your 
coverage, sufficient smoothing might eliminate zero values.

Anyway, that's about all I have to suggest. I hope you find something 
useful in there.

-Ryan

On Fri Aug  8 10:19:39 2014, Aliaksei Holik wrote:
> Dear Bioconductors and especially *Ranges team,
>
> I would like your expert opinion on whether what I'm doing is a
> sensible thing to do.
>
> I wish to plot some ChIP read coverage data in Gviz. The way I do it
> is by reading a number of bam files with readGAlignments() and
> calculating the coverage with the likewise named function. I would
> also like to normalise the coverage for each bam file for library size
> and the Input sample. I do so by simply dividing the coverage objects
> by the respective library size and by the coverage object for the
> Input sample. It all works beautifully, and the graphs look ok. But
> I'm left with a nagging feeling, that it's too easy to be true. So I
> guess my question is: am I right to assume that by dividing one
> coverage object by another, the coverage at each position in one
> object would be divided by the coverage at the same very position in
> the other object. I'm almost expecting something to go wrong, if, for
> instance, I have 0 coverage at some position in my reference object.
> But perhaps I should have faith in Bioconductor.
>
> Many thanks,
>
> Aliaksei.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor