[BioC] EdgeR: replicated pools, yes or not?

Fri Apr 25 23:17:56 CEST 2014

Apologies for the multiple copies sent of this email. My mailer was 
having issues.

On Fri 25 Apr 2014 02:16:03 PM PDT, Ryan C. Thompson wrote:
> Thinking about it, it should theoretically be possible to model the
> dispersion term of a pool as being derived from a mixture of N
> individuals. For example, taking the model used by edgeR and DESeq,
> the biological variation is Gamma distributed and the technical
> variation is Poisson distributed (yielding the NB distribution for the
> counts). So, instead of modelling the biological variation as a single
> gamma distribution, we could model it as the mean of n independent and
> identically distributed Gamma variables. However, the mean of N iid
> Gamma(k,theta) random variables is (I think) a Gamma(k * N, theta / N)
> random variable (using the shape-scale parametrization from
> Wikipedia). So the NB distribution is equally valid (or equally
> invalid) for both individuals and pools. Based on this, I would think
> that if you have pools, it is perfectly reasonable to use edgeR or
> DESeq or any other NB method on the pools. You will have fewer degrees
> of freedom than if you did all the samples without pooling, but your
> BCV will also be smaller (since gamma variance is k * theta^2). So, if
> you have already sequenced pools, I think NB-based methods will give
> you a valid answer (in terms of significance levels) based on your
> data, without you having to do anything special to account for the
> pooling. If you have pooled data and you want to estimate what the
> dispersions would be if you had individual samples, you could
> back-calculate the parameters by reversing the above (I forget exactly
> how the gamma distribution parameters relate to the mean and
> dispersion of the NB, but there is a formula for that). However,
> calculating this would only be for curiosity's sake, since this would
> be the distribution for observations that you don't have (i.e. counts
> for individual samples), so you can't do any statistics with it.
>
> As to whether pools are preferable, I still think the best way to
> figure this out would be to simulate an experiment with few samples vs
> few pools vs many samples and see what happens. My intuition based on
> the above is that analysis based on M pools would be more powerful
> than analysis based on M individuals, but of course would be less
> powerful than analysis based on all the M * N individuals. But I
> wouldn't trust my intuition, and even if I did, my intuition is based
> on the assumption of a gamma distribution for the biological
> variability, which is not necessarily a valid assumption in the first
> place, so again I stress the need for a simulation test to see which
> is better.
>
> -Ryan
>
> On 04/25/2014 06:26 AM, Manuel José Gómez Rodríguez wrote:
>> Dear Ryan,
>>
>> Thank you very much for your detailed answers.
>>
>>  From your comments it seems that a key point in terms of pondering
>> the advantages of pooling is, as you say, what is the relative
>> contribution of reducing biological variability and reducing degrees
>> of freedom.
>>
>> I guess that it may depend both on the number of pooled individuals
>> per sample and the level of variability expressed between individuals.
>>
>> Since it can be expected that the level of variability will be
>> different depending on the species, the tissue (if applies) and the
>> conditions, it may not be possible to get some general rule.
>>
>> Best regards,
>>
>> Manuel J Gómez