[BioC] EdgeR: replicated pools, yes or not?

Thu Apr 24 12:46:22 CEST 2014

Ryan <rct at ...> writes:

> 
> Don't pool. You are throwing away information. If you're going to do 24 
> animals, you may as well use 24 barcodes. To see that a separate 
> barcode for each animal provides strictly more information than 
> pooling, note that once you have used separate barcodes, you could add 
> the counts together to do in silico pooling and get the same result as 
> if you had done pooling in vitro. In other words, you can get from 
> separate barcodes to pooling by throwing away information.
> 
> For a literature reference, try "Efficient experimental design and 
> analysis strategies for the detection of differential expression using 
> RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019
> 
> That publication doesn't directly address the issue of pooling multiple 
> biological samples in the same barcode, but it does make clear that 
> more biological replication results in a drastic improvement in 
> results. You could simulate your described pooling scheme yourself: 
> simply simulate 24 libraries in 2 groups with some number of true 
> differentially expressed genes between them. Then pool them 3 at a time 
> (by adding their counts together) to get the pooled dataset of 8 pooled 
> libraries in 2 groups. Then perform the analysis on both datasets using 
> your preferred tool and compute the ROC curve. I think you will find 
> that pooling significantly diminishes your power to detect differential 
> expression.
> 
> -Ryan Thompson
> 
> On Wed Apr 23 09:42:15 2014, "Manuel J Gómez [guest]"   wrote:
> >
> > Hello,
> >
> > I would like to ask for your opinion on whether using replicated pools
in the context of RNASeq experiments
> makes sense, or not.
> >
> > Lets say that we are interested in detecting genes that are
differentially expressed in two genetic
> backgrounds (a certain KO mutant strain and the corresponding WT), in
mouse liver.
> >
> > We could perform an RNASeq experiment using liver tissue from four KO
and four WT with the same sex, age, and diet.
> >
> > We would have eight samples: four biological replicates for each of the
two conditions to be compared.
> >
> > However, we decide to pool liver tissue from three animals, to prepare
each of the eight samples (we would
> use, therefore 24 animals: 12 KO animals pooled to produce four KO
samples, and 12 WT animals pooled to
> produce four WT samples).
> >
> > We would do it following the argument that pooling samples to build
biological replicates reduces
> variation between replicates and increases the statistical power of the
analysis, resulting in a more
> sensitive detection of genes that are differentially expressed between
conditions.
> >
> > However, EdgeR relies, precisely, on measuring biological variability to
establish the statistical
> significance of differences in gene expression across conditions.
Therefore, pooling samples to buid
> biological replicates is not correct and we are, in fact, losing
statistical power. We are unable of
> determining whether the observed differences in gene expression are
significative or not.
> >
> > There are some publications dealing with this issue in the context of
microarrays (for example,
> Kendziorski et al, 2005, "On the utility of pooling biological samples in
microarray experiments",
> PNAS, 102:4252) but I have not found anything similar in the context of
RNASeq and, more specifically, of
> the analysis of RNASeq data with EdgeR.
> >
> > Any comment will be more than welcome, as well as any relevant references.
> >
> > Thanks a lot in advance.
> >
> >   -- output of sessionInfo():
> >
> > NA
> >
> > --
> > Sent via the guest posting facility at bioconductor.org.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor <at> r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor <at> r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

Dear Ryan,

Thanks a lot for your answer.

I perfectly understand that using 12 replicas for each condition is more
informative than using 4.

However, assuming that my budget allows me to sequence only a limited number
of samples at a decent coverage (for example, 8 samples at 10 million reads
per sample), which of the following would be the preferred solution?

a) using 8 samples obtained from 8 different animals (4 KO and 4 WT);
b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n"
animals (with the same genotype, obviously).

I am pretty sure that if the unique difference between the two types of
animal (or condition) is a specific mutation, solution (a) would be THE
correct solution because it would imply using truly biological and
independent replicates. Solution (b) would be not just less correct, but
blatantly incorrect, because it would eliminate biological variation between
replicates (specially if "n" is high), and having an estimation of that
variation is necessary to establish the significance of the differences
observed between conditions.

I acknowledge that I am answering myself, but I keep finding examples in
which pooling (in the sense that I am describing above) is not completely
discouraged. For example, Churchill (in "Fundamentals of experimental design
for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a
two-sample comparison, we could consider making two large pools of all
available units and measuring each pool multiple times. This is a poor
design, as it does not allow estimation of the between-pool variance. By
pooling all the available samples together we have minimized the biological
variance, but we have also eliminated all independent replication. It is
better to use several pools and fewer technical replicates". Why does he
write that it is better to use several pools? Wouldn't it be better to use
no pools at all?

Similarly, a discussion in which pooling is not completely discouraged can
be found in:

http://seqanswers.com/forums/showthread.php?t=27905

Finally, pooling samples is often justified because of limited availability
of RNA. In those cases pooling is mandatory, obviously. But if replicates
have been constructed by pooling RNA from many tiny individual samples,
shouldn't we have in mind that we have lost all information regarding
biological variance, and that we will not be able to asses the significance
of any differences observed between conditions?

- Manuel J Gómez