[BioC] package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes

Wolfgang Huber whuber at embl.de
Wed Mar 17 16:16:45 CET 2010


Dear Karl

[reposting to list]

The bioinformatician was quicker, and provided a hack that "works", but 
a statistician might have pointed out that the simulation scheme you 
propose below is a needlessly poor and slow approximation of what the 
hypergeometric distribution or the Fisher text would do faster and more 
exactly.

"Poor" because the distribution of count variables is (typically and in 
particular in your case) not symmetric and using a standard deviation to 
define a confidence interval and significance thresholds would ignore 
that - i.e. give suboptimal results.

Don't get me wrong - I think it's great when people are capable to 
reinvent the wheel, but to get stuff done, using existing wheel designs 
tends to be more productive.

PS I am not sure what you mean by "non-independent gene lists". If you 
already know that the lists are dependent, what exactly do you gain by 
showing that their overlap is higher than if they were independent? 
Isn't that tautological?

	Best wishes
	Wolfgang



Karl Brand scripsit 17/03/10 15:45:
> Cheers Wolfgang,
> 
> Unfortuantly waiting on my local statistician also take's longer than 
> using the calculator :(
> 
> Discussion with a much more responsive bioifnormatician yielded the plan 
> to employ a bootstrap/randomisation (terminology?!) approach. ie.:
> 
> By using the same numbers of the chip-background probes (c. 45,000) and 
> my short-list of probes of interest (c. 500), randomly  selected and 
> checking the overlap, performed say 10,000 times, an estimate of chance 
> overlap could be obtained, along with a stardard deviation to which i 
> could compare my actual results to for an estimate of significance, or 
> p-value.
> 
> Correct me if we're wrong but this seemed acceptable for Venns of 
> non-independent gene lists.
> 
> Coding this was what i was appealing for help on since my experience 
> here is limiting. But, i'm definately up for a crack at it. I'll start 
> by having a look at the "stats" package phyper.
> 
> Again with appreciation for your prompt, thoughtful response,
> 
> Karl
> 
> On 3/17/2010 2:48 PM, Wolfgang Huber wrote:
>> Dear Karl,
>>
>> I don't think what you need here is necessarily a package - the required
>> computations, if possible, are one or a few lines of R using standard
>> functions e.g. in the "stats" package such as phyper.
>>
>> Perhaps the more important thing to do is to precisely define the
>> questions you want to be asking. For this, discussion with a local
>> statistician might be helpful. Once you have that, the answer will
>> probably be fairly obvious from a basic text book on combinatorics
>> (probability theory on discrete variables).
>>
>> Best wishes
>> Wolfgang
>>
>>
>> Karl Brand scripsit 17/03/10 12:26:
>>> Dear BioCers,
>>>
>>> I've got six lists of gene's which i'm focused on the overlaps between.
>>>
>>> What i'm searching for is a package or code to quantify the
>>> significance of the overlap between both a pair of gene lists, and
>>> also between three gene-lists. Six might be interesting, but not
>>> necessary.
>>>
>>> Specifically, what would the overlap be expected by chance, and how
>>> many standard deviations my actual overlap is from the estimated
>>> chance overlap?
>>>
>>> Whilst some of my lists are independent, others are not in being
>>> derived from tissues of the same origin. I understand this would
>>> exclude such tests like Fishers Rxact test which assume independence.
>>>
>>> By using the same numbers of chip-background probes and short-listed
>>> probes of interest, randomly selected and checking the overlap,
>>> performed say 10,000 times, i think i could obtain the estimates i'm
>>> looking for in a 'statistically acceptable' manner.
>>>
>>> Does anyone know of a package or code written for this purpose? I
>>> failed to find anything in BioConductor or in the BioC lists. As
>>> simple as coding it no doubt is, my lack of R knowledge would make
>>> doing it with a calculator the faster option :)
>>>
>>> Look forward to any recommendations or suggestions with appreciation,
>>>
>>> Karl
>>>
>>>
>>
>>
> 


-- 

Best wishes
      Wolfgang


--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact



More information about the Bioconductor mailing list