[BioC] package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes

Thu Mar 18 17:25:14 CET 2010

Dear List,

I tried the phyper function as follows:

#phyper(overlaplistA&B-1, genelistA, totalprobesonchip-genelistA, 
genelistB, lower.tail = FALSE, log.p = FALSE)

Of which the output seemed logical to me. But I'd really appreciate some 
ones patience and experience to confirm some concerns:

-is it 'safe' to employ this test where genelistA and genelistB were 
obtained from AnimalX-tissue1 and AnimalX-tisse2 respectively? ie., do i 
violate any data independence issue's this test assumes?

-the output Value is a 'distribution function'. Can i interpret this to 
be something like the 'likelihood that my observed result is due to 
chance alone'?

-do in i need to subtract 1 from my 'overlap'? In the example i followed 
at tinyurl.com/ygtmefa this appaears to be the case, but the vignette 
has nothing on this.

*most of all* how can i perform this test on three lists of overlapping 
gene's, not merely the two in this case? Maybes some one knows a 
hack/method to combine the 3 outputs (of three pairwise comparisons) for 
an estimate of the 3-way overlap? Even a conservative estimate would be 
better than nothing!

With thanks in advance for thoughts and suggestions, cheers,

Karl

On 3/17/2010 5:16 PM, Karl Brand wrote:
> Thank you Wolfgang, Madelaine,
>
> I'd rather not reinvent the wheel if i can help it.
>
> And if you you'll humor me a little longer, perhaps you can ensure i do
> what you suggest correctly for my exact application.
>
> The overalps i have are between 6 datasets. The experiment consisted of
> a treatment (Pperiod) with 3 levels (S, E & L) applied to 2 tissues (R &
> C). FYI targets file below if it helps. Each of the 6 datasets contain
> 16 time points on which i interrogated for transcripts which fit a sine
> curve and several other criteria, thus defining a list of 'rhythmic
> genes' for each of the 6 datasets.
>
> So an obvious question is what rhythmic transcripts are common between
> various combination's of the 6 data sets. Combination's being-
>
> Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for
> tissue 'R'
> Venn 2: As above for tissue 'C'
> Venn 3: Overlapping 'R' and 'C' for treatment level 1 only.
> Venn 4: As for 3. for treatment level 2 only.
> Venn 5: As for 3. for treatment level 3 only.
>
> So what i meant by "non-independent gene lists" i think might apply to
> Venn 3, 4 and 5 given the fact that tissues 'R' & 'C' are obtained from
> the same animals, albeit 16 of them, and as time course's. But still,
> they can not strictly speaking be considered independent right? Which i
> thought some tests, including Fishers depend on.
>
> Knowing this, would you think the phyper function is the right one for
> my purpose. If so i'll plough on with the vindication of atleast the
> confidence that...some one with alot more experience on this than me
> recommends it!
>
> Again my thanks for engaging my query,
>
> Karl
>
>
> "RNA_Targets.txt"-
>
> FileName Tissue Pperiod Time Animal
> 01file.CEL R S 1 1
> 02file.CEL C S 1 1
> 03file.CEL R S 2 2
> 04file.CEL C S 2 2
> 05file.CEL R S 3 3
> 06file.CEL C S 3 3
> 07file.CEL R S 4 4
> 08file.CEL C S 4 4
> 09file.CEL R S 5 5
> 10file.CEL C S 5 5
> 11file.CEL R S 6 6
> 12file.CEL C S 6 6
> 13file.CEL R S 7 7
> 14file.CEL C S 7 7
> 15file.CEL R S 8 8
> 16file.CEL C S 8 8
> 17file.CEL R S 9 9
> 18file.CEL C S 9 9
> 19file.CEL R S 10 10
> 20file.CEL C S 10 10
> 21file.CEL R S 11 11
> 22file.CEL C S 11 11
> 23file.CEL R S 12 12
> 24file.CEL C S 12 12
> 25file.CEL R S 13 13
> 26file.CEL C S 13 13
> 27file.CEL R S 14 14
> 28file.CEL C S 14 14
> 29file.CEL R S 15 15
> 30file.CEL C S 15 15
> 31file.CEL R S 16 16
> 32file.CEL C S 16 16
> 33file.CEL R E 1 17
> 34file.CEL C E 1 17
> 35file.CEL R E 2 18
> 36file.CEL C E 2 18
> 37file.CEL R E 3 19
> 38file.CEL C E 3 19
> 39file.CEL R E 4 20
> 40file.CEL C E 4 20
> 41file.CEL R E 5 21
> 42file.CEL C E 5 21
> 43file.CEL R E 6 22
> 44file.CEL C E 6 22
> 45file.CEL R E 7 23
> 46file.CEL C E 7 23
> 47file.CEL R E 8 24
> 48file.CEL C E 8 24
> 49file.CEL R E 9 25
> 50file.CEL C E 9 25
> 51file.CEL R E 10 26
> 52file.CEL C E 10 26
> 53file.CEL R E 11 27
> 54file.CEL C E 11 27
> 55file.CEL R E 12 28
> 56file.CEL C E 12 28
> 57file.CEL R E 13 29
> 58file.CEL C E 13 29
> 59file.CEL R E 14 30
> 60file.CEL C E 14 30
> 61file.CEL R E 15 31
> 62file.CEL C E 15 31
> 63file.CEL R E 16 32
> 64file.CEL C E 16 32
> 65file.CEL R L 1 33
> 66file.CEL C L 1 33
> 67file.CEL R L 2 34
> 68file.CEL C L 2 34
> 69file.CEL R L 3 35
> 70file.CEL C L 3 35
> 71file.CEL R L 4 36
> 72file.CEL C L 4 36
> 73file.CEL R L 5 37
> 74file.CEL C L 5 37
> 75file.CEL R L 6 38
> 76file.CEL C L 6 38
> 77file.CEL R L 7 39
> 78file.CEL C L 7 39
> 79file.CEL R L 8 40
> 80file.CEL C L 8 40
> 81file.CEL R L 9 41
> 82file.CEL C L 9 41
> 83file.CEL R L 10 42
> 84file.CEL C L 10 42
> 85file.CEL R L 11 43
> 86file.CEL C L 11 43
> 87file.CEL R L 12 44
> 88file.CEL C L 12 44
> 89file.CEL R L 13 45
> 90file.CEL C L 13 45
> 91file.CEL R L 14 46
> 92file.CEL C L 14 46
> 93file.CEL R L 15 47
> 94file.CEL C L 15 47
> 95file.CEL R L 16 48
> 96file.CEL C L 16 48
>
>
>
>
>
> On 3/17/2010 4:16 PM, Wolfgang Huber wrote:
>> Dear Karl
>>
>> [reposting to list]
>>
>> The bioinformatician was quicker, and provided a hack that "works", but
>> a statistician might have pointed out that the simulation scheme you
>> propose below is a needlessly poor and slow approximation of what the
>> hypergeometric distribution or the Fisher text would do faster and more
>> exactly.
>>
>> "Poor" because the distribution of count variables is (typically and in
>> particular in your case) not symmetric and using a standard deviation to
>> define a confidence interval and significance thresholds would ignore
>> that - i.e. give suboptimal results.
>>
>> Don't get me wrong - I think it's great when people are capable to
>> reinvent the wheel, but to get stuff done, using existing wheel designs
>> tends to be more productive.
>>
>> PS I am not sure what you mean by "non-independent gene lists". If you
>> already know that the lists are dependent, what exactly do you gain by
>> showing that their overlap is higher than if they were independent?
>> Isn't that tautological?
>>
>> Best wishes
>> Wolfgang
>>
>>
>>
>> Karl Brand scripsit 17/03/10 15:45:
>>> Cheers Wolfgang,
>>>
>>> Unfortuantly waiting on my local statistician also take's longer than
>>> using the calculator :(
>>>
>>> Discussion with a much more responsive bioifnormatician yielded the
>>> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.:
>>>
>>> By using the same numbers of the chip-background probes (c. 45,000)
>>> and my short-list of probes of interest (c. 500), randomly selected
>>> and checking the overlap, performed say 10,000 times, an estimate of
>>> chance overlap could be obtained, along with a stardard deviation to
>>> which i could compare my actual results to for an estimate of
>>> significance, or p-value.
>>>
>>> Correct me if we're wrong but this seemed acceptable for Venns of
>>> non-independent gene lists.
>>>
>>> Coding this was what i was appealing for help on since my experience
>>> here is limiting. But, i'm definately up for a crack at it. I'll start
>>> by having a look at the "stats" package phyper.
>>>
>>> Again with appreciation for your prompt, thoughtful response,
>>>
>>> Karl
>>>
>>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote:
>>>> Dear Karl,
>>>>
>>>> I don't think what you need here is necessarily a package - the
>>>> required
>>>> computations, if possible, are one or a few lines of R using standard
>>>> functions e.g. in the "stats" package such as phyper.
>>>>
>>>> Perhaps the more important thing to do is to precisely define the
>>>> questions you want to be asking. For this, discussion with a local
>>>> statistician might be helpful. Once you have that, the answer will
>>>> probably be fairly obvious from a basic text book on combinatorics
>>>> (probability theory on discrete variables).
>>>>
>>>> Best wishes
>>>> Wolfgang
>>>>
>>>>
>>>> Karl Brand scripsit 17/03/10 12:26:
>>>>> Dear BioCers,
>>>>>
>>>>> I've got six lists of gene's which i'm focused on the overlaps
>>>>> between.
>>>>>
>>>>> What i'm searching for is a package or code to quantify the
>>>>> significance of the overlap between both a pair of gene lists, and
>>>>> also between three gene-lists. Six might be interesting, but not
>>>>> necessary.
>>>>>
>>>>> Specifically, what would the overlap be expected by chance, and how
>>>>> many standard deviations my actual overlap is from the estimated
>>>>> chance overlap?
>>>>>
>>>>> Whilst some of my lists are independent, others are not in being
>>>>> derived from tissues of the same origin. I understand this would
>>>>> exclude such tests like Fishers Rxact test which assume independence.
>>>>>
>>>>> By using the same numbers of chip-background probes and short-listed
>>>>> probes of interest, randomly selected and checking the overlap,
>>>>> performed say 10,000 times, i think i could obtain the estimates i'm
>>>>> looking for in a 'statistically acceptable' manner.
>>>>>
>>>>> Does anyone know of a package or code written for this purpose? I
>>>>> failed to find anything in BioConductor or in the BioC lists. As
>>>>> simple as coding it no doubt is, my lack of R knowledge would make
>>>>> doing it with a calculator the faster option :)
>>>>>
>>>>> Look forward to any recommendations or suggestions with appreciation,
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

-- 
Karl Brand k.brand-asperand-erasmusmc.nl
Department of Genetics
Erasmus MC
Dr Molewaterplein 50
3015 GE Rotterdam
lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268