[BioC] EdgeR: replicated pools, yes or not?

Fri Apr 25 23:27:04 CEST 2014

Hi Cei,

Yes, that is a good point. If your dominant cost is per sample and not 
per lane of sequencing, then you are back to the same situation as in 
microarrays, where you want to minimize the number of samples required 
to achieve a given level of significance. My other email gives my best 
attempt to address the question of pools vs individuals with the same 
number of samples in each case.

-Ryan

On 04/25/2014 08:18 AM, Cei Abreu-Goodger wrote:
> Hi Ryan,
>
> I would like to pop in just to emphasize something about the current 
> economics of sequencing that clearly depends on the lab or sequencing 
> facility you're using.
>
> In our institute, and it sounds to me like Manuel is in a similar 
> situation, the most expensive part of doing a proper RNA-seq 
> experiment is the cost of each (barcoded) library. When you reply "if 
> you have the capability to do b [8 pools], then you also probably have 
> the capability to do 8 * n unpooled samples" you are clearly 
> considering that the "per lane" cost of sequencing will be the same, 
> but are missing the reality that many labs pay quite heavily for each 
> library prep. For me, and surely for others, it is quite realistic to 
> only have enough money for a limited number of library preps (say 8 or 
> 12), even though we might have many more individuals (animals, plants, 
> cell cultures, what-not) at almost no extra cost. In these cases, 
> Manuel's question becomes quite relevant: should we pool many 
> individuals into the fixed number of samples to be made into 
> libraries, or should we try to make the libraries reflect as best as 
> possible unique "individuals"? Of course when the individual provides 
> too little RNA the question is moot, but what about cases like 
> Manuel's where a single animal or tissue is enough for a library?
>
> Best,
>
> Cei
>
>
> On 4/24/14 3:24 PM, Ryan Thompson wrote:
>>> However, assuming that my budget allows me to sequence only a limited
>>> number of samples at a decent coverage (for example, 8 samples at 10
>>> million reads per sample), which of the following would be the 
>>> preferred
>>> solution?
>>>
>>> a) using 8 samples obtained from 8 different animals (4 KO and 4 WT);
>>> b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n"
>>> animals (with the same genotype, obviously).
>>>
>>
>> The preferred solution would be to take your 8 * n animals and sequence
>> them all individually using the same total amount of sequencing as you
>> would have used for the 8 pools. Each individual sample will have n 
>> times
>> less coverage, but that doesn't matter because you have still done 
>> the same
>> total amount of sequencing per condition. I read a paper showing that
>> increasing the number of biological replicates for an RNA-seq experiment
>> while holding constant the total amount of sequencing (and therefore
>> reducing the sequencing per replicate) continued to give gains in
>> statistical power up to at least 192 biological replicates (which was 
>> the
>> largest number they tested). This was in simulations, of course.
>> Unfortunately, I can't find the citation in my ever-growing library of
>> articles, but maybe someone else can supply it.
>>
>> So, I'm not sure whether option a or b is better, but if you have the
>> capability to to b, then you also probably have the capability to do 
>> 8 * n
>> unpooled samples, which is unquestionably better than either a or b.
>>
>>
>>> I am pretty sure that if the unique difference between the two types of
>>> animal (or condition) is a specific mutation, solution (a) would be THE
>>> correct solution because it would imply using truly biological and
>>> independent replicates. Solution (b) would be not just less correct, 
>>> but
>>> blatantly incorrect, because it would eliminate biological variation
>>> between replicates (specially if "n" is high), and having an 
>>> estimation of
>>> that variation is necessary to establish the significance of the
>>> differences observed between conditions.
>>>
>>
>> This is not necessarily a problem, although it might be. With the pooled
>> samples, your estimate of biological variability will be smaller, but 
>> you
>> also fewer degrees of freedom than you would if you did all the samples
>> separately instead of pooling. I don't know which of these effects would
>> dominate. So your significance estimates may not be any less accurate or
>> unbiased, but they will probably be less precise since you are 
>> working with
>> fewer observations.
>>
>>>
>>> I acknowledge that I am answering myself, but I keep finding 
>>> examples in
>>> which pooling (in the sense that I am describing above) is not 
>>> completely
>>> discouraged. For example, Churchill (in "Fundamentals of experimental
>>> design for cDNA microarrays", 2002, Nature Genetics 32) explains 
>>> that "in a
>>> two-sample comparison, we could consider making two large pools of all
>>> available units and measuring each pool multiple times. This is a poor
>>> design, as it does not allow estimation of the between-pool 
>>> variance. By
>>> pooling all the available samples together we have minimized the 
>>> biological
>>> variance, but we have also eliminated all independent replication. 
>>> It is
>>> better to use several pools and fewer technical replicates". Why 
>>> does he
>>> write that it is better to use several pools? Wouldn't it be better 
>>> to use
>>> no pools at all?
>>>
>>
>> The considerations are different for microarrays. In sequencing, you can
>> divide up your available sequencing space into as many individual
>> replicates as you like. In microarrays, if you only have money to do 10
>> arrays, then you can only do 10 samples, so are forced to choose 
>> between 10
>> individuals or 10 pools.
>>
>>
>>> Similarly, a discussion in which pooling is not completely 
>>> discouraged can
>>> be found in:
>>>
>>> http://seqanswers.com/forums/showthread.php?t=27905
>>>
>>
>> The only place I see pooling not discouraged in that thread is the part
>> talking about 5 pools of 10 individuals each for 3 conditions vs 5
>> individuals each for 3 conditions. In that case Simon says that 
>> pooling is
>> acceptable because the money or labor costs of individually prepping 150
>> samples may be prohibitive. He still notes that this is the preferred
>> solution if possible, and he notes that there is a trade-off that 
>> must be
>> considered for the few samples vs few pools question. This echoes my 
>> answer
>> above in this reply.
>>
>> Finally, pooling samples is often justified because of limited 
>> availability
>>> of RNA. In those cases pooling is mandatory, obviously. But if 
>>> replicates
>>> have been constructed by pooling RNA from many tiny individual samples,
>>> shouldn't we have in mind that we have lost all information regarding
>>> biological variance, and that we will not be able to asses the 
>>> significance
>>> of any differences observed between conditions?
>>>
>>
>> You haven't lost *all* information about biological variance. There are
>> still different individuals going into each pool. For a concrete 
>> example,
>> when doing RNA-seq on C. elegans, a single worm doesn't provide 
>> sufficient
>> RNA, so each "sample" is actually a whole tank of worms all receiving 
>> the
>> same treatment, i.e. litterally a pool of individuals. I have 
>> analyzed such
>> an experiment, and the dispersions as estimated by edgeR were on par 
>> with
>> the general guide values one would expect for genetically identical
>> individuals. As I said above, there are the balancing factors of 
>> reducing
>> variability and reducing degrees of freedom, and I'm not exactly sure 
>> how
>> they balance out.
>>
>> -Ryan
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>