[BioC] DiffBind & Chip-exo

Tue Jul 23 20:56:07 CEST 2013


>Hi Giuseppe-
>
>I'm glad to sorted the column thing out, that was what I suspected.
>
>There shouldn't be much problem doing the analysis without a control
>track, particularly if the samples come from the same tissue. The main
>role of the control tracks is for peak calling. The reason the control
>track is less important for differential analysis is that youy are looking
>at the relative differences in read density at the same genomic intervals
>across samples, and not comparing read densities across intervals. So if
>the control track were similar at that location for all samples, it will
>not affect the differential analysis. The main issue would be if there
>were something like big copy number differences between samples. Then you
>could get sites that show as differentially bound when the real difference
>was the copy number. But the difference would be real regardless.
>
>Regarding sequencing depth, this should be taken care of by the
>normalisation step. It takes the library size (either full library size,
>which is the total number of reads, or the default effective library size,
>the number of reads within peaks for each sample) and adjusts the read
>counts. You can can an idea of how this is working by using the
>dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and
>bNormalized=FALSE to see if things even out. Also, after counting, you can
>look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if samples
>are grouping by sequencing depth -- try doing the same plots with
>different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and
>score=DBA_SCORE_TMM_READS_EFFECTIVE or score=DBA_SCORE_TMM_READS_FULL to
>see which gives to the best clustering.
>
>Hope this helps!
>
>Cheers-
>Rory
>
>On 23/07/2013 17:58, "Giuseppe Gallone" <giuseppe.gallone at dpag.ox.ac.uk>
>wrote:
>
>>Hi Rory
>>
>>I figured out what the problem was - the score column in my MACS2 beds
>>was the 5th, not the 4th (specified in the csv file)
>>
>>Having said this, do you have specific suggestions on running
>>control-less samples through DiffBind? Is there a mailing list where I
>>could learn to use the program properly / ask questions?
>>
>>A further question - I have sequence depth differences across the
>>samples. Should I manually sub-sample my biggest (in terms of read
>>depth) samples to a small common denominator before plotting
>>correlations in DiffBind - or will the software do it for me?
>>
>>Best regards
>>Giuseppe
>>
>>On 07/23/13 17:17, Rory Stark wrote:
>>> Hello Giuseppe-
>>>
>>> There shouldn't be any problem not having control reads. This looks
>>>like
>>> it could a mismatch with the peak file format. Could you send me
>>>
>>>   * The mysamples.csv file
>>>   *   The GM06986_peaks.bed.gz file (or just the first 100 lines or so)
>>>
>>> I'll take a look and let you know what the problem is.
>>>
>>> Cheers-
>>> Rory
>>>
>>> On 23/07/2013 17:11, "Giuseppe Gallone" <giuseppe.gallone at dpag.ox.ac.uk
>>> <mailto:giuseppe.gallone at dpag.ox.ac.uk>> wrote:
>>>
>>>     Dear Rory
>>>
>>>     I'm contacting you to know your thoughts on the possibility of
>>>using
>>>     DiffBind with a Chip-exo dataset. The dataset is composed of
>>>     transcription factor binding data for a large number of hapmap
>>>LCLs.
>>>
>>>     I have made a first attempt at utilising the program - I am however
>>>     experiencing some problems. I called peaks using MACS2 and also
>>>have
>>>     raw
>>>     read data in .bed format. I tried to build an initial .csv file in
>>>the
>>>     following format
>>>
>>>     
>>>SampleID,Tissue,Factor,Condition,Replicate,bamReads,bamControl,Peaks,Pea
>>>k
>>>Caller,PeakFormat,ScoreCol,LowerBetter
>>>     
>>>GM06986,,TF,stimulated,1,d_BED/TF_GM06986_reads.bed,,GM06986_peaks.bed.g
>>>z
>>>,MACS,raw,4,F
>>>
>>>     The bamControl field is empty, as is the Tissue field - The data is
>>>not
>>>     tissue specific, and as you might be aware chip-exo data does not
>>>     currently come with background/input control.
>>>
>>>     This is the command I use
>>>     tfr = dba(sampleSheet="mysamples.csv")
>>>
>>>
>>>     and this is the error:
>>>     Error in if (res >= minval) { : missing value where TRUE/FALSE
>>>needed
>>>     In addition: There were 30 warnings (use warnings() to see them)
>>>
>>>     with the generic warning being:
>>>     1: In Ops.factor(peaks[, pCol], width) : / not meaningful for
>>>factors
>>>
>>>     Is the error due to the lack of control reads? Thanks for your help
>>>&
>>>     suggestions.
>>>
>>>     Giuseppe
>>>
>>
>>-- 
>>Dr Giuseppe Gallone
>>MRC career development fellow
>>MRC Functional Genomics Unit - DPAG
>>University of Oxford, UK
>