[BioC] EdgeR: replicated pools, yes or not?

Fri Apr 25 23:16:03 CEST 2014

Thinking about it, it should theoretically be possible to model the 
dispersion term of a pool as being derived from a mixture of N 
individuals. For example, taking the model used by edgeR and DESeq, the 
biological variation is Gamma distributed and the technical variation is 
Poisson distributed (yielding the NB distribution for the counts). So, 
instead of modelling the biological variation as a single gamma 
distribution, we could model it as the mean of n independent and 
identically distributed Gamma variables. However, the mean of N iid 
Gamma(k,theta) random variables is (I think) a Gamma(k * N, theta / N) 
random variable (using the shape-scale parametrization from Wikipedia). 
So the NB distribution is equally valid (or equally invalid) for both 
individuals and pools. Based on this, I would think that if you have 
pools, it is perfectly reasonable to use edgeR or DESeq or any other NB 
method on the pools. You will have fewer degrees of freedom than if you 
did all the samples without pooling, but your BCV will also be smaller 
(since gamma variance is k * theta^2). So, if you have already sequenced 
pools, I think NB-based methods will give you a valid answer (in terms 
of significance levels) based on your data, without you having to do 
anything special to account for the pooling. If you have pooled data and 
you want to estimate what the dispersions would be if you had individual 
samples, you could back-calculate the parameters by reversing the above 
(I forget exactly how the gamma distribution parameters relate to the 
mean and dispersion of the NB, but there is a formula for that). 
However, calculating this would only be for curiosity's sake, since this 
would be the distribution for observations that you don't have (i.e. 
counts for individual samples), so you can't do any statistics with it.

As to whether pools are preferable, I still think the best way to figure 
this out would be to simulate an experiment with few samples vs few 
pools vs many samples and see what happens. My intuition based on the 
above is that analysis based on M pools would be more powerful than 
analysis based on M individuals, but of course would be less powerful 
than analysis based on all the M * N individuals. But I wouldn't trust 
my intuition, and even if I did, my intuition is based on the assumption 
of a gamma distribution for the biological variability, which is not 
necessarily a valid assumption in the first place, so again I stress the 
need for a simulation test to see which is better.

-Ryan

On 04/25/2014 06:26 AM, Manuel José Gómez Rodríguez wrote:
> Dear Ryan,
>
> Thank you very much for your detailed answers.
>
>  From your comments it seems that a key point in terms of pondering the advantages of pooling is, as you say, what is the relative contribution of reducing biological variability and reducing degrees of freedom.
>
> I guess that it may depend both on the number of pooled individuals per sample and the level of variability expressed between individuals.
>
> Since it can be expected that the level of variability will be different depending on the species, the tissue (if applies) and the conditions, it may not be possible to get some general rule.
>
> Best regards,
>
> Manuel J Gómez