[BioC] bsseq and BSmooth with RRBS

Fri Feb 22 10:19:27 CET 2013

On 21.02.2013 17:52, Kasper Daniel Hansen wrote:
> On Thu, Feb 21, 2013 at 11:31 AM, Alex Gutteridge
> <alexg at ruggedtextile.com> wrote:
>> I'm looking over the documentation for bsseq:
>>
>> http://www.bioconductor.org/packages/2.11/bioc/html/bsseq.html
>>
>> And the associated Genome Biology paper:
>>
>> http://genomebiology.com/content/pdf/gb-2012-13-10-r83.pdf
>>
>> From what I can see the method should work on reduced representation
>> bisulfite sequencing (RRBSeq) datasets as well as WGBSeq, though the
>> documentation only mentions WGBSeq. The only issue I can see being 
>> the
>> smoothing procedure which I thought might struggle in regions with 
>> scattered
>> CpGs in the RRBSeq.
>>
>> I will have data in hand shortly to test this myself, but was just 
>> curious
>> if anyone on the list had run bsseq (successfully or otherwise) on 
>> RRBS data
>> before?
>
> I don't have any hands on experience with RRBS, but I agree with your
> assessment that in principle it should work, depending on how large
> regions are actually captured in your experiment.  Mathematically, it
> should run out of the box.  Pay close attention to the maxGap
> parameter which sets a limit for how far away two neighboring CpGs 
> can
> be before not smoothing.
>
> I have had inquires like yours about using it for RRBS data, but I
> never heard back about their experience (and as I recall it, it was
> much earlier in the development phase, so the software ought to work
> much better now).
>
> I would be happy to hear your experience.
>
> Note that currently, (some of) the statistics are really meant for 
> the
> case where you have biological replicates, but that question is 
> really
> orthogonal to the choice of assay.
>
> Best,
> Kasper

Thanks Kasper,

I will give it a go and report back. I was also looking at modelling 
this on an individual locus basis with a binomial glm. I.e. With made up 
data:

> M
[1] 10 10 10 30 40 10
> U
[1] 10 20 15  3 10  0
> y = cbind(M,U)
> group = factor(c(0,0,0,1,1,1))
> model = glm(y~group,binomial)
> summary(model)

Call:
glm(formula = y ~ group, family = binomial)

Deviance Residuals:
       1        2        3        4        5        6
  0.9036  -0.7537   0.0000   0.8569  -1.1656   1.7354

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.4055     0.2357  -1.720   0.0854 .
group1        2.2225     0.3808   5.837 5.31e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

     Null deviance: 46.8213  on 5  degrees of freedom
Residual deviance:  6.4888  on 4  degrees of freedom
AIC: 28.198

Number of Fisher Scoring iterations: 4

Where M is the number of methylated reads at a single locus across 6 
samples and U is the number of unmethylated reads at the same locus in 
the same
samples. Group is a factor giving the design of the experiment (here 
two groups both with n=3).

Is the conclusion from your work that pre-smoothing the data gives 
better performance than this naive way for detecting DMRs? If we were 
interested in testing specific CpG loci (as opposed to broader regions) 
would this approach make sense?

-- 
Alex Gutteridge