[BioC] EdgeR: replicated pools, yes or not?

Manuel José Gómez Rodríguez manueljose.gomez at cnic.es
Thu Apr 24 12:35:51 CEST 2014


Dear Ryan,

Thanks a lot for your answer.

I perfectly understand that using 12 replicas for each condition is more informative than using 4.

However, assuming that my budget allows me to sequence only a limited number of samples at a decent coverage (for example, 8 samples at 10 million reads per sample), which of the following would be the preferred solution?

a) using 8 samples obtained from 8 different animals (4 KO and 4 WT);
b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n" animals (with the same genotype, obviously).

I am pretty sure that if the unique difference between the two types of animal (or condition) is a specific mutation, solution (a) would be THE correct solution because it would imply using truly biological and independent replicates. Solution (b) would be not just less correct, but blatantly incorrect, because it would eliminate biological variation between replicates (specially if "n" is high), and having an estimation of that variation is necessary to establish the significance of the differences observed between conditions.

I acknowledge that I am answering myself, but I keep finding examples in which pooling (in the sense that I am describing above) is not completely discouraged. For example, Churchill (in "Fundamentals of experimental design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a two-sample comparison, we could consider making two large pools of all available units and measuring each pool multiple times. This is a poor design, as it does not allow estimation of the between-pool variance. By pooling all the available samples together we have minimized the biological variance, but we have also eliminated all independent replication. It is better to use several pools and fewer technical replicates". Why does he write that it is better to use several pools? Wouldn't it be better to use no pools at all?

Similarly, a discussion in which pooling is not completely discouraged can be found in:

http://seqanswers.com/forums/showthread.php?t=27905

Finally, pooling samples is often justified because of limited availability of RNA. In those cases pooling is mandatory, obviously. But if replicates have been constructed by pooling RNA from many tiny individual samples, shouldn't we have in mind that we have lost all information regarding biological variance, and that we will not be able to asses the significance of any differences observed between conditions?

Manuel J Gómez
________________________________________
From: Ryan [rct at thompsonclan.org]
Sent: Wednesday, April 23, 2014 7:06 PM
To: "\"Manuel J Gómez [guest]\" "
Cc: bioconductor at r-project.org; Manuel José Gómez Rodríguez
Subject: Re: [BioC] EdgeR: replicated pools, yes or not?

Don't pool. You are throwing away information. If you're going to do 24
animals, you may as well use 24 barcodes. To see that a separate
barcode for each animal provides strictly more information than
pooling, note that once you have used separate barcodes, you could add
the counts together to do in silico pooling and get the same result as
if you had done pooling in vitro. In other words, you can get from
separate barcodes to pooling by throwing away information.

For a literature reference, try "Efficient experimental design and
analysis strategies for the detection of differential expression using
RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019

That publication doesn't directly address the issue of pooling multiple
biological samples in the same barcode, but it does make clear that
more biological replication results in a drastic improvement in
results. You could simulate your described pooling scheme yourself:
simply simulate 24 libraries in 2 groups with some number of true
differentially expressed genes between them. Then pool them 3 at a time
(by adding their counts together) to get the pooled dataset of 8 pooled
libraries in 2 groups. Then perform the analysis on both datasets using
your preferred tool and compute the ROC curve. I think you will find
that pooling significantly diminishes your power to detect differential
expression.

-Ryan Thompson

On Wed Apr 23 09:42:15 2014, "Manuel J Gómez [guest]"   wrote:
>
> Hello,
>
> I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments makes sense, or not.
>
> Lets say that we are interested in detecting genes that are differentially expressed in two genetic backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver.
>
> We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet.
>
> We would have eight samples: four biological replicates for each of the two conditions to be compared.
>
> However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to produce four WT samples).
>
> We would do it following the argument that pooling samples to build biological replicates reduces variation between replicates and increases the statistical power of the analysis, resulting in a more sensitive detection of genes that are differentially expressed between conditions.
>
> However, EdgeR relies, precisely, on measuring biological variability to establish the statistical significance of differences in gene expression across conditions. Therefore, pooling samples to buid biological replicates is not correct and we are, in fact, losing statistical power. We are unable of determining whether the observed differences in gene expression are significative or not.
>
> There are some publications dealing with this issue in the context of microarrays (for example, Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of the analysis of RNASeq data with EdgeR.
>
> Any comment will be more than welcome, as well as any relevant references.
>
> Thanks a lot in advance.
>
>   -- output of sessionInfo():
>
> NA
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

*************** AVISO LEGAL ***************
Este mensaje va dirigido, de manera exclusiva, a su destinatario y
contiene información confidencial y sujeta al secreto profesional,
cuya divulgación no está permitida por la ley. En caso de haber
recibido este mensaje por error, le rogamos que, de forma inmediata,
nos lo comunique mediante correo electrónico remitido a nuestra
atención o a través del teléfono (+34 914531200) y proceda a su
eliminación, así como a la de cualquier documento adjunto al mismo.
Asimismo, le comunicamos que la distribución, copia o utilización de
este mensaje, o de cualquier documento adjunto al mismo, cualquiera
que fuera su finalidad, están prohibidas por la ley. Le informamos,
como destinatario de este mensaje, que el correo electrónico y las
comunicaciones por medio de Internet no permiten asegurar ni
garantizar la confidencialidad de los mensajes transmitidos, así como
tampoco su integridad o su correcta recepción, por lo que el CNIC no
asume responsabilidad alguna por tales circunstancias. Si no
consintiese la utilización del correo electrónico o de las
comunicaciones vía Internet le rogamos nos lo comunique y ponga en
nuestro conocimiento de manera inmediata.

*************** LEGAL NOTICE **************
This message is intended exclusively for the person to whom it is
addressed and contains privileged and confidential information
protected from disclosure by law. If you are not the addressee
indicated in this message, you should immediately delete it and any
attachments and notify the sender by reply e-mail or by phone
(+34 914531200). In such case, you are hereby notified that any
dissemination, distribution, copying or use of this message or any
attachments, for any purpose, is strictly prohibited by law. We
hereby inform you, as addressee of this message, that e-mail and
Internet do not guarantee the confidentiality, nor the completeness
or proper reception of the messages sent and, thus, CNIC does not
assume any liability for those circumstances. Should you not agree
to the use of e-mail or to communications via Internet, you are
kindly requested to notify us immediately.



More information about the Bioconductor mailing list