[BioC] package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes

Wed Mar 17 17:02:19 CET 2010

On Wed, Mar 17, 2010 at 11:16 AM, Wolfgang Huber <whuber at embl.de> wrote:
> Dear Karl
>
> [reposting to list]
>
> The bioinformatician was quicker, and provided a hack that "works", but a
> statistician might have pointed out that the simulation scheme you propose
> below is a needlessly poor and slow approximation of what the hypergeometric
> distribution or the Fisher text would do faster and more exactly.
>
> "Poor" because the distribution of count variables is (typically and in
> particular in your case) not symmetric and using a standard deviation to
> define a confidence interval and significance thresholds would ignore that -
> i.e. give suboptimal results.
>
> Don't get me wrong - I think it's great when people are capable to reinvent
> the wheel, but to get stuff done, using existing wheel designs tends to be
> more productive.
>
> PS I am not sure what you mean by "non-independent gene lists". If you
> already know that the lists are dependent, what exactly do you gain by
> showing that their overlap is higher than if they were independent? Isn't
> that tautological?
>
>        Best wishes
>        Wolfgang
>
>
>
> Karl Brand scripsit 17/03/10 15:45:
>>
>> Cheers Wolfgang,
>>
>> Unfortuantly waiting on my local statistician also take's longer than
>> using the calculator :(
>>
>> Discussion with a much more responsive bioifnormatician yielded the plan
>> to employ a bootstrap/randomisation (terminology?!) approach. ie.:
>>
>> By using the same numbers of the chip-background probes (c. 45,000) and my
>> short-list of probes of interest (c. 500), randomly  selected and checking
>> the overlap, performed say 10,000 times, an estimate of chance overlap could
>> be obtained, along with a stardard deviation to which i could compare my
>> actual results to for an estimate of significance, or p-value.

Just to add to Wolfgang's sentiments here:

Using random permutation testing is essentially assuming that the
findings (both within sample and between samples) are "independent" of
each other.  Such permutation testing is useful for accounting for
some other biases in the data (more than one probe per gene, for
example).  This isn't a bad way to go given that the dependencies and
correlations are generally unknown, but it is important to realize
that such an analysis has these underlying assumptions.

Sean

>> Correct me if we're wrong but this seemed acceptable for Venns of
>> non-independent gene lists.
>>
>> Coding this was what i was appealing for help on since my experience here
>> is limiting. But, i'm definately up for a crack at it. I'll start by having
>> a look at the "stats" package phyper.
>>
>> Again with appreciation for your prompt, thoughtful response,
>>
>> Karl
>>
>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote:
>>>
>>> Dear Karl,
>>>
>>> I don't think what you need here is necessarily a package - the required
>>> computations, if possible, are one or a few lines of R using standard
>>> functions e.g. in the "stats" package such as phyper.
>>>
>>> Perhaps the more important thing to do is to precisely define the
>>> questions you want to be asking. For this, discussion with a local
>>> statistician might be helpful. Once you have that, the answer will
>>> probably be fairly obvious from a basic text book on combinatorics
>>> (probability theory on discrete variables).
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>> Karl Brand scripsit 17/03/10 12:26:
>>>>
>>>> Dear BioCers,
>>>>
>>>> I've got six lists of gene's which i'm focused on the overlaps between.
>>>>
>>>> What i'm searching for is a package or code to quantify the
>>>> significance of the overlap between both a pair of gene lists, and
>>>> also between three gene-lists. Six might be interesting, but not
>>>> necessary.
>>>>
>>>> Specifically, what would the overlap be expected by chance, and how
>>>> many standard deviations my actual overlap is from the estimated
>>>> chance overlap?
>>>>
>>>> Whilst some of my lists are independent, others are not in being
>>>> derived from tissues of the same origin. I understand this would
>>>> exclude such tests like Fishers Rxact test which assume independence.
>>>>
>>>> By using the same numbers of chip-background probes and short-listed
>>>> probes of interest, randomly selected and checking the overlap,
>>>> performed say 10,000 times, i think i could obtain the estimates i'm
>>>> looking for in a 'statistically acceptable' manner.
>>>>
>>>> Does anyone know of a package or code written for this purpose? I
>>>> failed to find anything in BioConductor or in the BioC lists. As
>>>> simple as coding it no doubt is, my lack of R knowledge would make
>>>> doing it with a calculator the faster option :)
>>>>
>>>> Look forward to any recommendations or suggestions with appreciation,
>>>>
>>>> Karl
>>>>
>>>>
>>>
>>>
>>
>
>
> --
>
> Best wishes
>     Wolfgang
>
>
> --
> Wolfgang Huber
> EMBL
> http://www.embl.de/research/units/genome_biology/huber/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>