[BioC] DESEq | Batch effect | VST data and linear model adjustment

Simon Anders anders at embl.de
Wed Jul 23 14:03:56 CEST 2014


Hi Francesco

On 23/07/14 13:39, Pesce, Francesco wrote:
> We have two cohorts, cases and controls and a set of covariates for both of them ( center, library.prep.date, age, rna.rin.score, sex ).
> Center and library.prep.date are collinear with the status (all the cases were collected in London while the controls
> were collected in 4 different centers worldwide) so I used the first principal component of these two covariates and ran DESeq2 using this design:
> ~ PC1 + age + rna.rin.score + sex + status
>
> Unfortunately it looks like the batch effect is too strong and I have ~16K genes with adjP<0.05
> One question: is the fold-change still reliable  (So that I can use it as rank for GSEA analysie for example) ?
> Now, although the differential expression might be hampered by the study design and I don’t know if I can use these results (what do you think?)

That your status (control/disease) is aliased with the sequencing centre 
is a severe problem. If you find a gene to differ between control and 
disease samples, you have no way of knowing whether this difference is 
due to the status or due to specific peculiarities in the library 
handling of the centres. If there are any differences in the way how the 
London centre handles its samples, compared to the other centres, and 
these differences are substantial enough to cause genes to appear 
differentially expressed, then this will completely overwhelm any signal.

Unless you have taken special precautions, this is quite likely to 
happen, and the fact that you got so much stronger differences than 
expected, indicated that it likely did happen.

So, no, I don't think you can use any results from this study.

I am very puzzled about your regression on PC1. Could you explain? Maybe 
I misunderstood you, but I don't see how PCA analysis can cure an 
aliasing issue. I'm afraid your main problem is that you seem to be 
under the wrong impression that it were possible, at least in principle, 
to tease apart the effect of status and center.

> the main problem is the following:
>
> I have based all the analyses for my PhD thesis and the manuscript I am preparing using DESeq.
> The pipeline is based on co-expression clustering (WGCNA), diffco-ex between cases and controls and GWAS hits enrichment in these clusters.
>
> For the pre-processing of these analyses I’ve first obtained the VST data and then adjusted these for the covariates using a linear model. Then I used the residuals for the analyses:
>
>> vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
>> vstMat <- assay(vsd)
>> lm=lm(vstMat ~ as.factor(info$library.prep.date) + as.numeric(info$age) + as.factor(info$sex) + as.numeric(info$rna.rin.score) + as.factor(info$center))
>> data = residuals(lm)
>
> The main question is that we are not sure if this pre-processing is correct, does the linear model work here for this purpose on VST data ?
> (The doubt came by the fact that a fold-change for one gene of interest suggested it was strongly up-regulated in cases, but when boxplotting the residuals from the linear model adjusted for all the covariates you don’t see any difference at all…)

If you regress out the effect of center, you also regress out the effect 
of status, because it is aliased. Hence, your clustering on the 
residuals can pick up all kinds of effects, but they will be almost 
guaranteed to have nothing to do with the disease/control difference.

So, of course, you don't see any differences between case and control 
after you regress out center.

Apart from that: We now recommend the rlog transformation in DESeq2 
instead of the VST from DESeq. It is more reliable in case of varying 
library sizes.

I'm not familiar with WGCNA, so I cannot comment on how well it works 
with transformed RNA-Seq data. In principle, it should work, I suppose.

But, of course, it won't magically cure your aliasing issue either.

   Simon



More information about the Bioconductor mailing list