[BioC] edgeR: mixing technical replicates from Illumina HiSeq and MiSeq

Mon Sep 1 17:49:17 CEST 2014

Dea Gordon, Ryan and Nicolas,

Than you all for the detailed advice.

I have one more question regarding the blocking factor model. In my case I
have, actually, 2 external factors to consider - one is the platform, the
other one are the subjects.

My sample matrix is the following (I've attached the CSV in case you can't
view the image):

I am only interested in comparing treatments B:D to A (the latter are
controls). So far I've never had a model with more than one external
factor. I imagine it should be OK to have more - is this correct? If yes -
can you, perhaps, check whether I am setting the model matrix correctly?
(Apologies if this sounds too trivial) I imagine it shall be defined as:

Platform <- factor(targets$Platform)
> Subject  <- factor(targets$Subject)
> Treatment <- factor(targets$Treatment)
> design <- model.matrix(~Platform+Subject+Treatment)

..
> fit <- glmFit(y, design)
> lrt <- glmLRT(fit, coef=24) # for comparing Treatment B to Treatment A

Is this correct?

On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:

> Dear Nick,
>
> If you go back to the post from 2010 that you give the URL for, you will
> see that I was giving very briefly the same advice about checking Poisson
> variability that Ryan has explained at greater detail.
>
> You don't give any information about read lengths, sequence depths or
> alignment methods.  I would be surprised if MiSeq and HiSeq would generate
> perfect Poisson replicates of one another, especially if the read lengths
> from the two platform are different or the alignment and counting software
> has been varied.  So you may well end up back at the blocking idea.
>
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.statsci.org/smyth
>
> On Sun, 31 Aug 2014, Ryan wrote:
>
>  Thanks to the underlying theory behind dispersion estimation, you can
>> easily test whether your "technical replicates" really do represent
>> technical replicates. Specifically, read counts in technical replicates
>> should follow a Poisson distribution, which is a special case of the
>> negative binomial with zero dispersion. So, simply fit a model using edgeR
>> or DESeq2 with a separate coefficient for each group of technical
>> replicates. Thus all the experimental variation will be absorbed into the
>> model coefficients and the only thing left will be the technical
>> variability of of the replicates. For true technical replicates, the
>> dispersion should be zero for all genes. So if you estimate dispersions
>> using this model, and plotBCV/plotDispEsts shows the dispersion very near
>> to zero, then you can be confident that you really have technical
>> replicates. If the dispersion is nonzero, then there is some additional
>> source of unaccounted-for variation.
>>
>> I have used this method on a pilot dataset with several technical
>> replicates for each condition. edgeR said the dispersion was something like
>> 10^-3 or less for all genes except for the very low-expressed genes.
>>
>> -Ryan
>>
>> On 8/28/14, 9:23 AM, Nick N wrote:
>>
>>> Hi,
>>>
>>> I have a study where a fraction of the samples have been replicated on 2
>>> Illumina platforms (HiSeq and Miseq). These are technical replicates - the
>>> library preparation is the same using the same biological replicates - it's
>>> only the sequencing which is different.
>>>
>>> My hunch was that I shall introduce the platform as as an additional
>>> (blocking) factor in the analysis. Than I stumbled upon this post:
>>>
>>> https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html
>>>
>>> It recommends pooling the replicates. The post seems to apply to a
>>> different case ("pure" technical replicates, i.e. no differences in the
>>> sequencing platform used) so I probably shall ignore it. But I still feel a
>>> bit uncertain of the best way to treat the technical replicates. Can you,
>>> please, advise me on this?
>>>
>>> many thanks!
>>> Nick
>>>
>>
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the
> addressee.
> You must not disclose, forward, print or use it without the permission of
> the sender.
> ______________________________________________________________________
>