[R] Multivariate hypergeometric distribution version of phyper()

Charles C. Berry cberry at tajo.ucsd.edu
Tue Mar 30 17:55:37 CEST 2010


On Tue, 30 Mar 2010, Karl Brand wrote:

> Dear R Users,
>
> I employed the phyper() function to estimate the likelihood that the number 
> of genes overlapping between 2 different lists of genes is due to chance. 
> This appears to work appropriately.
>
> Now i want to try this with 3 lists of genes which phyper() does not appear 
> to support.
>
> Some googling suggests i can utilize the Multivariate hypergeometric 
> distribution to achieve this. eg.:
>
> http://en.wikipedia.org/wiki/Hypergeometric_distribution
>
> But when i try to do this manually using the choose() function (see attempt 
> below example with just two gene lists) i'm unable to perform the 
> calculations- the numbers hit infinity before getting an answer.
>
> Searching cran archives for "Multivariate hypergeometric" show this term in 
> the vignettes of package's ‘combinat’ and ‘forward’. But i'm unable to make 
> sense of the these pachakege functions in the context of my aforementioned 
> apllication.
>
> Can some one suggest a function, script or method to achieve my goal of 
> estimating the likelyhood of overlap between 3 lists of genes, ideally using 
> the multivariate hypergeometric, or anything else for that matter?


Two suggestions:

 	1) Don't! Likely the theory is unsuited for the application. In
            most applications that generate lists of genes, the genes are
            not iid realizations and the hypergeometric gives results that
            are astonishingly anticonservative. As an alternative , the
            block bootstrap may be suitable. See
 	        http://171.66.122.45/cgi/content/abstract/17/6/760

 	   and Google (scholar) 'genomic "block bootstrap"' for some
 	  starting points.


 	2) Take this thread to the bioconductor list. You are much
            more likely to get pointers to useful packages and functions
            for genomic statistical software there.

HTH,

Chuck


>
> cheers in advance,
>
> Karl
>
>
>
> #example attempt with two gene lists m & n
> N <- 45101 # total number balls in urn
> m <- 720   # number of 'white' or 'special' balls in urn, aka 'success'
> n <- 801   # number balls drawn or number of samples
> k <- 40    # number of 'white' or 'special' balls DRAWN
>
> a <- choose(m,k)
> b <- choose((N-m),(n-k))
> z <- choose(N,n)
> prK <- (a*b)/z #'the answer'
> print(prK)
> [1] NaN
>
>>  a
> [1] 7.985852e+65
>>  b
> [1] Inf
>>  z
> [1] Inf
>
>
> -- 
> Karl Brand
> Department of Genetics
> Erasmus MC
> Dr Molewaterplein 50
> 3015 GE Rotterdam
> T +31 (0)10 704 3457 | F +31 (0)10 704 4743 | M +31 (0)642 777 268
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list