[BioC] EdgeR: replicated pools, yes or not?
Ryan C. Thompson
rct at thompsonclan.org
Fri Apr 25 23:15:30 CEST 2014
Thinking about it, it should theoretically be possible to model the
dispersion term of a pool as being derived from a mixture of N
individuals. For example, taking the model used by edgeR and DESeq, the
biological variation is Gamma distributed and the technical variation is
Poisson distributed (yielding the NB distribution for the counts). So,
instead of modelling the biological variation as a single gamma
distribution, we could model it as the mean of n independent and
identically distributed Gamma variables. However, the mean of N iid
Gamma(k,theta) random variables is (I think) a Gamma(k * N, theta / N)
random variable (using the shape-scale parametrization from Wikipedia).
So the NB distribution is equally valid (or equally invalid) for both
individuals and pools. Based on this, I would think that if you have
pools, it is perfectly reasonable to use edgeR or DESeq or any other NB
method on the pools. You will have fewer degrees of freedom than if you
did all the samples without pooling, but your BCV will also be smaller
(since gamma variance is k * theta^2). So, if you have already sequenced
pools, I think NB-based methods will give you a valid answer (in terms
of significance levels) based on your data, without you having to do
anything special to account for the pooling. If you have pooled data and
you want to estimate what the dispersions would be if you had individual
samples, you could back-calculate the parameters by reversing the above
(I forget exactly how the gamma distribution parameters relate to the
mean and dispersion of the NB, but there is a formula for that).
However, calculating this would only be for curiosity's sake, since this
would be the distribution for observations that you don't have (i.e.
counts for individual samples), so you can't do any statistics with it.
As to whether pools are preferable, I still think the best way to figure
this out would be to simulate an experiment with few samples vs few
pools vs many samples and see what happens. My intuition based on the
above is that analysis based on M pools would be more powerful than
analysis based on M individuals, but of course would be less powerful
than analysis based on all the M * N individuals. But I wouldn't trust
my intuition, and even if I did, my intuition is based on the assumption
of a gamma distribution for the biological variability, which is not
necessarily a valid assumption in the first place, so again I stress the
need for a simulation test to see which is better.
On 04/25/2014 06:26 AM, Manuel José Gómez Rodríguez wrote:
> Dear Ryan,
> Thank you very much for your detailed answers.
> From your comments it seems that a key point in terms of pondering the advantages of pooling is, as you say, what is the relative contribution of reducing biological variability and reducing degrees of freedom.
> I guess that it may depend both on the number of pooled individuals per sample and the level of variability expressed between individuals.
> Since it can be expected that the level of variability will be different depending on the species, the tissue (if applies) and the conditions, it may not be possible to get some general rule.
> Best regards,
> Manuel J Gómez
More information about the Bioconductor