[BioC] DEseq for chip-seq data normalisation

Giuseppe Gallone giuseppe.gallone at dpag.ox.ac.uk
Tue Nov 5 13:57:11 CET 2013


Hello Lucia

this is all great info! Thank you very much for taking the time to share 
your findings. I am indeed using diffbind, and found some interesting 
results, however now I'd need to access directly my downsampled mapped 
reads.

With diffbind, I was basically throwing in my mapped reads  and my peak 
intervals and getting a differential analysis out (after setting up a 
contrast).

Now I'd like to try something slightly different and need to work with 
normalised bams of my samples. The problem I have is that my 10 bam have 
wildly varying numbers of mapping reads. I would like to downsample them 
all to a minimum common before examining quantitative differences in the 
peak signals across them.

I was hoping I could do this with DEseq: feed it some bams and obtain 
normalised versions of them. But I understand this is not possible?

I guess I will try to downsample my bams by myself using for example 
picard and then take it from there. Are there maybe some alternatives 
you'd suggest? I know MACS also allows to downsample bam. Thanks!

Giuseppe

On 11/04/13 18:13, Lucia Peixoto wrote:
> Hi Giuseppe,
> Unfortunately there is not much available to do stats on ChIPseq data.
> It is my experience that the data shows exactly the same overdispersion
> problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2 to
> analyze ChIPseq data is the way to go. There are a couple of challenges
> along the way that make this undertaking not quite straightforward.The
> only bioconductor package that I know tries to tackle this issues is
> DiffBind, so you can give it a try.
>
> One of the main differences is that unlike gene or exon coordinates,
> peaks in your individual replicates will not be exactly in the same
> place, if you are working with TF data this will not be too bad, but
> anything nucleosome associated will have considerable phase shift from
> replicate to replicate. So you first have to do some sort of merging of
> reproducible peaks into regions.
>
> I do not recommend doing the peak calling with the pooled data.After
> doing several ChIP-seq experiments with replicates I have observed that
> a lot of peaks, even ones with high z-scores/low p-values, do not show
> up in more than one replicate (but maybe this is particular to my type
> of experiments).  Merging all the peaks leads to a high number of false
> positives. So you need to integrate the peak locations into a single
> file but make sure you have a minimum number of carriers for each peak,
> I usually do presence in at least 2 of the replicates.
> You can make a gff file that you can feed into HTSeq in which you define
> the reproducible peak regions on your samples as if it was the gff with
> the gene models, but making this file takes a little bit of work.
> We are currently preparing a package for CRAN submission to specifically
> integrate the analysis of ChIP-seq data with replicates to EdgeR and
> DESeq, addressing most of what I mentioned above and including a peak
> caller for ease of flow of the analysis.I cannot finish the submission
> until the accompanying biological paper is out, so it won't be available
> until next year.
>
> hope this was helpful
> best
>
> Lucia
>
>
>
> On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone
> <giuseppe.gallone at dpag.ox.ac.uk <mailto:giuseppe.gallone at dpag.ox.ac.uk>>
> wrote:
>
>     Hi
>
>     I would like to use DEseq or DEseq2 to normalise the peak signal for
>     some Chip-seq data across 10 biological replicates.
>
>     I started looking at the DEseq documentation - it seems the program
>     requires a matrix arrangement of raw count data, where each row is a
>     peak and each column is a replicate.
>
>     What is the best way to obtain this? I have bam files for the reads,
>     obtained with BWA, and bed files (or alternatively narrowPeak files)
>     for the peak intervals, obtained using MACS.
>
>     I gather it is possible to use a program called HTseq to compute
>     these counts, however this program seems unable to deal with bed
>     files, only with gff files, and I'd prefer working directly with my
>     beds if at all possible. Thank you.
>
>     Best regards
>     Giuseppe
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
> --
> Lucia Peixoto PhD
> Postdoctoral Research Fellow
> Laboratory of Dr. Ted Abel
> Department of Biology
> School of Arts and Sciences
> University of Pennsylvania
>
> "Think boldly, don't be afraid of making mistakes, don't miss small
> details, keep your eyes open, and be modest in everything except your
> aims."
> Albert Szent-Gyorgyi



More information about the Bioconductor mailing list