[BioC] deseq2 for pair-wise analysis

Michael Love michaelisaiahlove at gmail.com
Tue Aug 12 21:14:09 CEST 2014

hi Subbaiah,

On Tue, Aug 12, 2014 at 10:33 AM, Subbaiah Chalivendra [guest]
<guest at bioconductor.org> wrote:
> I have just started analyzing my transcriptome assembly data by deseq2. I am finding that my two control samples do not coincide very well. This may be affecting the number of differentially expressed genes if I use the average  of two controls to compare with the treated. Is there a way I can minimize the effects of biological variation on the analysis?
> The overall counts are almost the same in all four files (~59-65 million, with an average of 62 million and SD of 3 million). C1 and C2 (untreated) are two different batches of experimental material and one half of them are dosed (D1 and D2; treated). They are kind of paired that way, although they are distinct set of individuals between C1 & D1 or C2 & D2. We actually started with 3 reps but decided to pool the 2nd and 3rd reps to have sufficient RNA for follow up qPCR assays.
> For some reason (could be biological differences between batches), the counts in C2 are 3-5 fold high compared to C1 for only some transcripts. There is a faintly similar trend between D2 and D1, although the treatment has wiped off this difference in many of those transcripts.
> I am wondering if I can use the difference between the pairs (C1&D1 and C2&D2), at least some of the biological variability may be masked allowing us to see the effects of the treatment more clearly on the transcriptome.
> I find from the FAQs that there is a way to do pair-wise analysis in deseq2. I appreciate if you can elaborate the answer a little more.

Do you have a more specific question? We have in the vignette:

"FAQ: Can I use DESeq2 to analyze paired samples? Yes, you should use
a multi-factor design which includes the sample information as a term
in the design formula. This will account for differences between the
samples while estimating the effect due to the condition. The
condition of interest should go at the end of the design formula. See
Section 1.5."

so you would have a design of ~ sample + treatment

where the column data should look like

sample treatment
1 C
2 C
1 D
2 D

> I have also another question related to this study. We actually used another closely related species for the study. Since this is small in size, we had to pool all three reps into one.  Transcripts are >75% identical at the DNA sequence level and the gene expression changes are similar between the species in response to the treatment. I am thinking of combining this data as a third rep of the study for deseq2 analysis, instead of throwing out the data. I appreciate to have your input about this step as well.

I wouldn't recommend adding this sample from the different species,
because the differences in counts due to the different transcripts of
the other species would throw off the analysis (whether you align the
reads to the genome/transcriptome of that species or to the
genome/transcriptome of the species with the 4 samples you are already
using). Basically, the feature-level effects on counts would not be
expected to cancel out across samples.


> I am also attaching the code I used (thanks to Dave Wheeler).
> Thank you very much for your help.
> Subbaiah Chalivendra
>  -- output of sessionInfo():
> R version 3.1.1 (2014-07-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [5] LC_TIME=English_United States.1252
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> loaded via a namespace (and not attached):
> [1] BiocGenerics_0.10.0  GenomeInfoDb_1.0.2   GenomicRanges_1.16.3
> [4] IRanges_1.22.9       parallel_3.1.1       stats4_3.1.1
> [7] XVector_0.4.0
> --
> Sent via the guest posting facility at bioconductor.org.

More information about the Bioconductor mailing list