[BioC] "Normalising" GRanges coverage object
Ryan
rct at thompsonclan.org
Fri Aug 8 20:40:26 CEST 2014
Hi Aliaksei,
First of all, dividing by library size is not the way to go. It's wrong
in ChIP-seq for the same reason it's wrong in RNA-seq: compositional
bias. In order to properly normalize for sequencing depth, I would
recommend the strategy described here:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066778/
You just need to look at the small section about calculating
normalization factors using 10kb bins. Once you calculate the
normalization factors, multiply them by the nominal library sizes to
get the normalized library sizes. Then divide the raw coverage by the
normalized library sizes to get the normalized coverage, which has been
adjusted for both sequencing depth and compositional bias. You should
be able to normalize input and ChIP samples together in this way as
long as your ChIP peaks are not omnipresent throughout the entire
genome.
From there, you could theoretically subtract from each sample the
corresponding input (or maybe the average of all inputs, since they
should all be equivalent) to get enrichment above input, or divide to
get fold enrichment over input. You are correct in your supposition
that arithmetic operations on Rle objects (i.e. the type of object
returned by coverage) work element-wise, just like arithmetic on
ordinary R numeric vectors.
Of course you are correct that if the input coverage is zero at any
point, then dividing by it will result in an infinite ratio, which you
probably don't want. You could take another page out of the RNA-seq
playbook and add a small constant to each coverage object before
dividing to regularize the ratios. You could also smooth out the input
coverage by replacing each coverage value with a mean over a window
centered at that position. You can do this with the runmean function
from the caTools package (from CRAN, not Bioconductor), using options
endrule="mean" and align="center". Depending on the sparseness of your
coverage, sufficient smoothing might eliminate zero values.
Anyway, that's about all I have to suggest. I hope you find something
useful in there.
-Ryan
On Fri Aug 8 10:19:39 2014, Aliaksei Holik wrote:
> Dear Bioconductors and especially *Ranges team,
>
> I would like your expert opinion on whether what I'm doing is a
> sensible thing to do.
>
> I wish to plot some ChIP read coverage data in Gviz. The way I do it
> is by reading a number of bam files with readGAlignments() and
> calculating the coverage with the likewise named function. I would
> also like to normalise the coverage for each bam file for library size
> and the Input sample. I do so by simply dividing the coverage objects
> by the respective library size and by the coverage object for the
> Input sample. It all works beautifully, and the graphs look ok. But
> I'm left with a nagging feeling, that it's too easy to be true. So I
> guess my question is: am I right to assume that by dividing one
> coverage object by another, the coverage at each position in one
> object would be divided by the coverage at the same very position in
> the other object. I'm almost expecting something to go wrong, if, for
> instance, I have 0 coverage at some position in my reference object.
> But perhaps I should have faith in Bioconductor.
>
> Many thanks,
>
> Aliaksei.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list