[BioC] edgeR: mixing technical replicates from Illumina HiSeq and MiSeq
Nikolay N.
nikolay12 at gmail.com
Fri Sep 5 23:31:14 CEST 2014
Thanks for all the advice so far. I tried that route but I am getting
> y <- estimateGLMCommonDisp(d, design)
>
> Error in glmFit.default(y, design = design, dispersion = dispersion,
>> offset = offset, :
>
> Design matrix not of full rank. The following coefficients not
>> estimable:
>
> TreatmentB TreatmentC TreatmentD TreatmentE TreatmentF
>
>
For the sake of completeness here is my code:
targets <- readTargets()
>
> d <- readDGE(targets$FileName)
>
> keep <- rowSums(cpm(d)>1) >=3
>
> d <- d[keep,]
>
> d$samples$lib.size <- colSums(d$counts)
>
> d <- calcNormFactors(d)
>
> all_cpm=cpm(d, normalized.lib.size=TRUE)
>
> all_counts <- cbind(rownames(all_cpm), all_cpm)
>
> colnames(all_counts)[1] <- "Ensembl.Gene.ID"
>
> rownames(all_counts) <- NULL
>
> Subject <- factor(targets$Subject)
>
> Platform <- factor(targets$Platform)
>
> Treatment <- factor(targets$Treatment)
>
> design <- model.matrix(~Platform+Subject+Treatment)
>
> y <- estimateGLMCommonDisp(d, design)
>
>
The design matrix looks OK to me:
> design
>
> (Intercept) PlatformMiSeq Subjecta8 Subjectb6 Subjectb7 Subjectb8
>> Subjectc5 Subjectc6 Subjectc7 Subjectc8 Subjectd5 Subjectd6 Subjectd7
>> Subjectd8 Subjecte5 Subjecte6
>
> 1 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 2 1 1 1 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 3 1 1 0 1 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 4 1 1 0 0 1 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 5 1 1 0 0 0 1
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 6 1 1 0 0 0 0
>> 1 0 0 0 0 0 0 0
>> 0 0
>
> 7 1 1 0 0 0 0
>> 0 1 0 0 0 0 0 0
>> 0 0
>
> 8 1 1 0 0 0 0
>> 0 0 1 0 0 0 0 0
>> 0 0
>
> 9 1 1 0 0 0 0
>> 0 0 0 1 0 0 0 0
>> 0 0
>
> 10 1 1 0 0 0 0
>> 0 0 0 0 1 0 0 0
>> 0 0
>
> 11 1 1 0 0 0 0
>> 0 0 0 0 0 1 0 0
>> 0 0
>
> 12 1 1 0 0 0 0
>> 0 0 0 0 0 0 1 0
>> 0 0
>
> 13 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 1
>> 0 0
>
> 14 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 1 0
>
> 15 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 1
>
> 16 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 17 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 18 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 19 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 20 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 21 1 1 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0
>
> 22 1 0 0 0 0 0
>> 1 0 0 0 0 0 0 0
>> 0 0
>
> 23 1 0 0 0 0 0
>> 0 1 0 0 0 0 0 0
>> 0 0
>
> 24 1 0 0 0 0 0
>> 0 0 1 0 0 0 0 0
>> 0 0
>
> 25 1 0 0 0 0 0
>> 0 0 0 1 0 0 0 0
>> 0 0
>
> 26 1 0 0 0 0 0
>> 0 0 0 0 1 0 0 0
>> 0 0
>
> 27 1 0 0 0 0 0
>> 0 0 0 0 0 1 0 0
>> 0 0
>
> 28 1 0 0 0 0 0
>> 0 0 0 0 0 0 1 0
>> 0 0
>
> 29 1 0 0 0 0 0
>> 0 0 0 0 0 0 0 1
>> 0 0
>
> Subjecte7 Subjecte8 Subjectf5 Subjectf6 Subjectf7 Subjectf8 TreatmentB
>> TreatmentC TreatmentD TreatmentE TreatmentF
>
> 1 0 0 0 0 0 0 0
>> 0 0 0 0
>
> 2 0 0 0 0 0 0 0
>> 0 0 0 0
>
> 3 0 0 0 0 0 0 1
>> 0 0 0 0
>
> 4 0 0 0 0 0 0 1
>> 0 0 0 0
>
> 5 0 0 0 0 0 0 1
>> 0 0 0 0
>
> 6 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 7 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 8 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 9 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 10 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 11 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 12 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 13 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 14 0 0 0 0 0 0 0
>> 0 0 1 0
>
> 15 0 0 0 0 0 0 0
>> 0 0 1 0
>
> 16 1 0 0 0 0 0 0
>> 0 0 1 0
>
> 17 0 1 0 0 0 0 0
>> 0 0 1 0
>
> 18 0 0 1 0 0 0 0
>> 0 0 0 1
>
> 19 0 0 0 1 0 0 0
>> 0 0 0 1
>
> 20 0 0 0 0 1 0 0
>> 0 0 0 1
>
> 21 0 0 0 0 0 1 0
>> 0 0 0 1
>
> 22 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 23 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 24 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 25 0 0 0 0 0 0 0
>> 1 0 0 0
>
> 26 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 27 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 28 0 0 0 0 0 0 0
>> 0 1 0 0
>
> 29 0 0 0 0 0 0 0
>> 0 1 0 0
>
>
I'm also attaching the target file in case it adds any info.
Can you, perhaps, give any advice on this?
ps. I am not sure whether I should continue submitting to this thread or
open up a new one as this problem may not be related to the original
question I asked.
On Wed, Sep 3, 2014 at 12:12 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
> Yes, that's all that is needed.
>
> In general, one corrects for a batch effect by fitting
>
> ~ batch + otherterms
>
> where "otherterms" is what the model would have been without the batch
> effect.
>
> Best wishes
> Gordon
>
>
> On Mon, 1 Sep 2014, Nick N wrote:
>
> Dea Gordon, Ryan and Nicolas,
>>
>> Than you all for the detailed advice.
>>
>> I have one more question regarding the blocking factor model. In my case I
>> have, actually, 2 external factors to consider - one is the platform, the
>> other one are the subjects.
>>
>> My sample matrix is the following (I've attached the CSV in case you can't
>> view the image):
>>
>>
>>
>>
>>
>> I am only interested in comparing treatments B:D to A (the latter are
>> controls). So far I've never had a model with more than one external
>> factor. I imagine it should be OK to have more - is this correct? If yes -
>> can you, perhaps, check whether I am setting the model matrix correctly?
>> (Apologies if this sounds too trivial) I imagine it shall be defined as:
>>
>> Platform <- factor(targets$Platform)
>>
>>> Subject <- factor(targets$Subject)
>>> Treatment <- factor(targets$Treatment)
>>> design <- model.matrix(~Platform+Subject+Treatment)
>>>
>>
>> ..
>>
>>> fit <- glmFit(y, design)
>>> lrt <- glmLRT(fit, coef=24) # for comparing Treatment B to Treatment A
>>>
>>
>>
>> Is this correct?
>>
>>
>>
>>
>> On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth <smyth at wehi.edu.au>
>> wrote:
>>
>> Dear Nick,
>>>
>>> If you go back to the post from 2010 that you give the URL for, you will
>>> see that I was giving very briefly the same advice about checking Poisson
>>> variability that Ryan has explained at greater detail.
>>>
>>> You don't give any information about read lengths, sequence depths or
>>> alignment methods. I would be surprised if MiSeq and HiSeq would
>>> generate
>>> perfect Poisson replicates of one another, especially if the read lengths
>>> from the two platform are different or the alignment and counting
>>> software
>>> has been varied. So you may well end up back at the blocking idea.
>>>
>>>
>>> Best wishes
>>> Gordon
>>>
>>> ---------------------------------------------
>>> Professor Gordon K Smyth,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> http://www.statsci.org/smyth
>>>
>>> On Sun, 31 Aug 2014, Ryan wrote:
>>>
>>> Thanks to the underlying theory behind dispersion estimation, you can
>>>
>>>> easily test whether your "technical replicates" really do represent
>>>> technical replicates. Specifically, read counts in technical replicates
>>>> should follow a Poisson distribution, which is a special case of the
>>>> negative binomial with zero dispersion. So, simply fit a model using
>>>> edgeR
>>>> or DESeq2 with a separate coefficient for each group of technical
>>>> replicates. Thus all the experimental variation will be absorbed into
>>>> the
>>>> model coefficients and the only thing left will be the technical
>>>> variability of of the replicates. For true technical replicates, the
>>>> dispersion should be zero for all genes. So if you estimate dispersions
>>>> using this model, and plotBCV/plotDispEsts shows the dispersion very
>>>> near
>>>> to zero, then you can be confident that you really have technical
>>>> replicates. If the dispersion is nonzero, then there is some additional
>>>> source of unaccounted-for variation.
>>>>
>>>> I have used this method on a pilot dataset with several technical
>>>> replicates for each condition. edgeR said the dispersion was something
>>>> like
>>>> 10^-3 or less for all genes except for the very low-expressed genes.
>>>>
>>>> -Ryan
>>>>
>>>> On 8/28/14, 9:23 AM, Nick N wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> I have a study where a fraction of the samples have been replicated on
>>>>> 2
>>>>> Illumina platforms (HiSeq and Miseq). These are technical replicates -
>>>>> the
>>>>> library preparation is the same using the same biological replicates -
>>>>> it's
>>>>> only the sequencing which is different.
>>>>>
>>>>> My hunch was that I shall introduce the platform as as an additional
>>>>> (blocking) factor in the analysis. Than I stumbled upon this post:
>>>>>
>>>>> https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html
>>>>>
>>>>> It recommends pooling the replicates. The post seems to apply to a
>>>>> different case ("pure" technical replicates, i.e. no differences in the
>>>>> sequencing platform used) so I probably shall ignore it. But I still
>>>>> feel a
>>>>> bit uncertain of the best way to treat the technical replicates. Can
>>>>> you,
>>>>> please, advise me on this?
>>>>>
>>>>> many thanks!
>>>>> Nick
>>>>>
>>>>
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the
> addressee.
> You must not disclose, forward, print or use it without the permission of
> the sender.
> ______________________________________________________________________
>
-------------- next part --------------
FileName Platform Subject Treatment
../A7/accepted_hits_clean.count MiSeq a7 A
../A8/accepted_hits_clean.count MiSeq a8 A
../B6/accepted_hits_clean.count MiSeq b6 B
../B7/accepted_hits_clean.count MiSeq b7 B
../B8/accepted_hits_clean.count MiSeq b8 B
../C5/accepted_hits_clean.count MiSeq c5 C
../C6/accepted_hits_clean.count MiSeq c6 C
../C7/accepted_hits_clean.count MiSeq c7 C
../C8/accepted_hits_clean.count MiSeq c8 C
../D5/accepted_hits_clean.count MiSeq d5 D
../D6/accepted_hits_clean.count MiSeq d6 D
../D7/accepted_hits_clean.count MiSeq d7 D
../D8/accepted_hits_clean.count MiSeq d8 D
../E5/accepted_hits_clean.count MiSeq e5 E
../E6/accepted_hits_clean.count MiSeq e6 E
../E7/accepted_hits_clean.count MiSeq e7 E
../E8/accepted_hits_clean.count MiSeq e8 E
../F5/accepted_hits_clean.count MiSeq f5 F
../F6/accepted_hits_clean.count MiSeq f6 F
../F7/accepted_hits_clean.count MiSeq f7 F
../F8/accepted_hits_clean.count MiSeq f8 F
../SC5/accepted_hits_clean.count HiSeq c5 C
../SC6/accepted_hits_clean.count HiSeq c6 C
../SC7/accepted_hits_clean.count HiSeq c7 C
../SC8/accepted_hits_clean.count HiSeq c8 C
../SD5/accepted_hits_clean.count HiSeq d5 D
../SD6/accepted_hits_clean.count HiSeq d6 D
../SD7/accepted_hits_clean.count HiSeq d7 D
../SD8/accepted_hits_clean.count HiSeq d8 D
More information about the Bioconductor
mailing list