[BioC] "Normalising" GRanges coverage object

Aliaksei Holik salvador at bio.bsu.by
Sat Aug 9 01:48:27 CEST 2014


That's great. Thanks Ryan!

On 09/08/2014 04:40, Ryan wrote:
> Hi Aliaksei,
>
> First of all, dividing by library size is not the way to go. It's wrong
> in ChIP-seq for the same reason it's wrong in RNA-seq: compositional
> bias. In order to properly normalize for sequencing depth, I would
> recommend the strategy described here:
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066778/
>
> You just need to look at the small section about calculating
> normalization factors using 10kb bins. Once you calculate the
> normalization factors, multiply them by the nominal library sizes to get
> the normalized library sizes. Then divide the raw coverage by the
> normalized library sizes to get the normalized coverage, which has been
> adjusted for both sequencing depth and compositional bias. You should be
> able to normalize input and ChIP samples together in this way as long as
> your ChIP peaks are not omnipresent throughout the entire genome.
>
>  From there, you could theoretically subtract from each sample the
> corresponding input (or maybe the average of all inputs, since they
> should all be equivalent) to get enrichment above input, or divide to
> get fold enrichment over input. You are correct in your supposition that
> arithmetic operations on Rle objects (i.e. the type of object returned
> by coverage) work element-wise, just like arithmetic on ordinary R
> numeric vectors.
>
> Of course you are correct that if the input coverage is zero at any
> point, then dividing by it will result in an infinite ratio, which you
> probably don't want. You could take another page out of the RNA-seq
> playbook and add a small constant to each coverage object before
> dividing to regularize the ratios. You could also smooth out the input
> coverage by replacing each coverage value with a mean over a window
> centered at that position. You can do this with the runmean function
> from the caTools package (from CRAN, not Bioconductor), using options
> endrule="mean" and align="center". Depending on the sparseness of your
> coverage, sufficient smoothing might eliminate zero values.
>
> Anyway, that's about all I have to suggest. I hope you find something
> useful in there.
>
> -Ryan
>
> On Fri Aug  8 10:19:39 2014, Aliaksei Holik wrote:
>> Dear Bioconductors and especially *Ranges team,
>>
>> I would like your expert opinion on whether what I'm doing is a
>> sensible thing to do.
>>
>> I wish to plot some ChIP read coverage data in Gviz. The way I do it
>> is by reading a number of bam files with readGAlignments() and
>> calculating the coverage with the likewise named function. I would
>> also like to normalise the coverage for each bam file for library size
>> and the Input sample. I do so by simply dividing the coverage objects
>> by the respective library size and by the coverage object for the
>> Input sample. It all works beautifully, and the graphs look ok. But
>> I'm left with a nagging feeling, that it's too easy to be true. So I
>> guess my question is: am I right to assume that by dividing one
>> coverage object by another, the coverage at each position in one
>> object would be divided by the coverage at the same very position in
>> the other object. I'm almost expecting something to go wrong, if, for
>> instance, I have 0 coverage at some position in my reference object.
>> But perhaps I should have faith in Bioconductor.
>>
>> Many thanks,
>>
>> Aliaksei.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list