[R] Code find exact distribution for runs test?

Dale Steele dale.w.steele at gmail.com
Fri Feb 12 14:33:40 CET 2010


Wow!  Your reply amply demonstrates the power of understanding the
math (or finding someone who does) before turning on the computer.

One last R question...how could I efficiently enumerate all
distinguishable permutations of a vector of signs?

vect <- c( -1, -1, 1, 1, 1)

permn(vect) #length:  factorial(length(vect))
????           #length:  choose(n, n1)

Best Regards.  --Dale


On Thu, Feb 11, 2010 at 10:04 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> Try this:
>
> druns2 <- function(x, n1, n2) {
>
>        if( x %% 2 ) { # odd x
>                choose( n1 - 1, n1 - (x+1)/2 ) * choose( n2 - 1, n2 - (x-1)/2 ) / choose( n1+n2, n1 ) +
>                choose( n2 - 1, n2 - (x+1)/2 ) * choose( n1 - 1, n1 - (x-1)/2 ) / choose( n1+n2, n2 )
>        } else { # even x
>                choose( n1 - 1, n1 - x/2 ) * choose( n2 - 1, n2 - x/2 ) / choose( n1 + n2, n1 ) +
>                choose( n2 - 1, n2 - x/2 ) * choose( n1 - 1, n1 - x/2 ) / choose( n1 + n2, n2 )
>        }
> }
>
> exp.2 <- sapply( 2:7, druns2, n1=3, n2=4 )
> exp.2 - exp.r/factorial(7)
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: Dale Steele [mailto:dale.w.steele at gmail.com]
>> Sent: Thursday, February 11, 2010 5:28 PM
>> To: Greg Snow
>> Cc: R-help at r-project.org
>> Subject: Re: [R] Code find exact distribution for runs test?
>>
>> Thanks for this!  My original query was probably unclear.  I think you
>> have answered a different, possibly more interesting question.  My
>> goal was to find an exact distribution of a total number of runs R in
>> samples of two types of size (n1, n2) under the null hypothesis of
>> randomness.
>>
>> The horribly inefficient code below generates results which match
>> Table 10 in Wackerly, Mendehall and Scheaffer for the distribution of
>> the total number of runs R in samples of  size (n1, n2); P(R <= r),
>> under the null hypothesis of randomness.  Horribly inefficient because
>> couldn't figure out how to generate only the distinguishable
>> permutations of my sample vector. Hope this brute force approach
>> illustrates what I am after.
>>
>>
>> nruns <- function(x) {
>>     signs <- sign(x)
>>     runs <- rle(signs)
>>     r <- length(runs$lengths)
>>     return(r)
>> }
>>
>> library(combinat)
>> # for example (n1=3, n2=4)
>> n1 <- 3;  n2 <- 4; n <- n1+n2
>> vect <- rep( c(-1,1), c(3,4))
>> vect
>>
>> # ! 'nruns(vect)' generates factorial(7) permutations, only
>> choose(7,3) are distinguishable
>>
>> exp.r <- table(unlist(permn(vect, nruns )))
>> cumsum(dist/factorial(7))
>>
>> # Result agrees with Table 10, pg 814 (n1=3, n2=4)
>> #in Wackerly, Mendenhall and Scheaffer, 2002, presumably done using
>> # exact calculations.
>>
>> Thanks.  --Dale
>>
>> On Thu, Feb 11, 2010 at 6:08 PM, Greg Snow <Greg.Snow at imail.org> wrote:
>> > OK, I think I have it worked out for both cases:
>> >
>> > library(vcd)
>> >
>> > druns <- function(x, n) { # x runs in n data points, not vectorized
>> >                                                # based on sample
>> median
>> >
>> >        if( n%%2 ) stop('n must be even')
>> >
>> >        if( x %% 2 ) { # odd x
>> >                choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1, n/2-
>> (x-1)/2 )/
>> >                                choose(n,n/2) * 2
>> >        } else {                # even x
>> >                choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-x/2
>> )/
>> >                                choose(n,n/2) * 2
>> >        }
>> > }
>> >
>> > ## population median
>> > out1 <- replicate( 100000, {x <- rnorm(10);
>> length(rle(sign(x))$lengths) } )
>> >
>> > rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
>> > chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
>> >
>> >
>> > ## sample median
>> > out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x);
>> length(rle(sign(x))$lengths) } )
>> >
>> > fit <- sapply( 2:10, druns, n=10 )
>> >
>> > rootogram( table(out2), fit * 100000 )
>> > chisq.test( table(out2), p=fit )
>> >
>> >
>> > The tests only fail to prove me wrong, not a firm proof that I am
>> correct.  But given the simulation size I am at least pretty close.  I
>> can see why people worked out approximations before we had good
>> computers, I would not want to calculate the above by hand.
>> >
>> > Enjoy,
>> >
>> > --
>> > Gregory (Greg) L. Snow Ph.D.
>> > Statistical Data Center
>> > Intermountain Healthcare
>> > greg.snow at imail.org
>> > 801.408.8111
>> >
>> >
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> project.org] On Behalf Of Greg Snow
>> >> Sent: Thursday, February 11, 2010 12:19 PM
>> >> To: Dale Steele; R-help at r-project.org
>> >> Subject: Re: [R] Code find exact distribution for runs test?
>> >>
>> >> I am not an expert in this area, but here are some thoughts that may
>> >> get you started towards an answer.
>> >>
>> >> First, there are 2 ways (possibly more) that can lead to the data
>> for a
>> >> runs test that lead to different theoretical distributions:
>> >>
>> >> 1. We have a true or hypothesized value of the median that we
>> >> subtracted from the data, therefore each value has 50% probability
>> of
>> >> being positive/negative and all are independent of each other
>> (assuming
>> >> being exactly equal to the median is impossible or discarded).
>> >>
>> >> 2. We have subtracted the sample median from each sample value (and
>> >> discarded any 0's) leaving us with exactly half positive and half
>> >> negative and not having independence.
>> >>
>> >> In case 1, the 1st observation will always start a run.  The second
>> >> observation has a 50% chance of being the same sign (F) or different
>> >> sign (S), with the probability being 0.5 for each new observation
>> and
>> >> them all being independent (under assumption of random selections
>> from
>> >> population with known/hypothesized median) and the number of runs
>> >> equaling the number of S's, this looks like a binomial to me (with
>> some
>> >> '-1's inserted in appropriate places.
>> >>
>> >> In case 2, this looks like a hypergeometric distribution, there
>> would
>> >> be n!/((n/2)!(n/2)!) possible permutations, just need to compute how
>> >> many of those permutations result in x runs to get the probability.
>> >> There is probably a way to think about this in terms of balls and
>> urns,
>> >> but I have not worked that out yet.
>> >>
>> >> Hope this helps,
>> >>
>> >> --
>> >> Gregory (Greg) L. Snow Ph.D.
>> >> Statistical Data Center
>> >> Intermountain Healthcare
>> >> greg.snow at imail.org
>> >> 801.408.8111
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> > project.org] On Behalf Of Dale Steele
>> >> > Sent: Wednesday, February 10, 2010 6:16 PM
>> >> > To: R-help at r-project.org
>> >> > Subject: [R] Code find exact distribution for runs test?
>> >> >
>> >> > I've been attempting to understand the one-sample run test for
>> >> > randomness.  I've found run.test{tseries} and run.test{lawstat}.
>> >> Both
>> >> > use a large sample approximation for distribution of the total
>> number
>> >> > of runs in a sample of n1 observations of one type and n2
>> >> observations
>> >> > of another type.
>> >> >
>> >> > I've been unable to find R code to generate the exact distribution
>> >> and
>> >> > would like to see how this could be done (not homework).
>> >> >
>> >> > For example, given the data:
>> >> >
>> >> > dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12,
>> -
>> >> 9,
>> >> > 6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
>> >> >
>> >> > The Monte Carlo permutation approach seems to get me part way.
>> >> >
>> >> >
>> >> > # calculate the number of runs in the data vector
>> >> > nruns <- function(x) {
>> >> >     signs <- sign(x)
>> >> >     runs <- rle(signs)
>> >> >     r <- length(runs$lengths)
>> >> >     return(r)
>> >> > }
>> >> >
>> >> > MC.runs <- function(x, nperm) {
>> >> > RUNS <- numeric(nperm)
>> >> > for (i in  1:nperm) {
>> >> >     RUNS[i] <- nruns(sample(x))
>> >> > }
>> >> >     cdf <- cumsum(table(RUNS))/nperm
>> >> >     return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
>> >> > }
>> >> >
>> >> > MC.runs(dtemp, 100000)
>> >> >
>> >> > Thanks.  --Dale
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide http://www.R-project.org/posting-
>> >> > guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/posting-
>> >> guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>



More information about the R-help mailing list