[R] Code find exact distribution for runs test?

Fri Feb 12 01:27:36 CET 2010

Thanks for this!  My original query was probably unclear.  I think you
have answered a different, possibly more interesting question.  My
goal was to find an exact distribution of a total number of runs R in
samples of two types of size (n1, n2) under the null hypothesis of
randomness.

The horribly inefficient code below generates results which match
Table 10 in Wackerly, Mendehall and Scheaffer for the distribution of
the total number of runs R in samples of  size (n1, n2); P(R <= r),
under the null hypothesis of randomness.  Horribly inefficient because
couldn't figure out how to generate only the distinguishable
permutations of my sample vector. Hope this brute force approach
illustrates what I am after.

nruns <- function(x) {
    signs <- sign(x)
    runs <- rle(signs)
    r <- length(runs$lengths)
    return(r)
}

library(combinat)
# for example (n1=3, n2=4)
n1 <- 3;  n2 <- 4; n <- n1+n2
vect <- rep( c(-1,1), c(3,4))
vect

# ! 'nruns(vect)' generates factorial(7) permutations, only
choose(7,3) are distinguishable

exp.r <- table(unlist(permn(vect, nruns )))
cumsum(dist/factorial(7))

# Result agrees with Table 10, pg 814 (n1=3, n2=4)
#in Wackerly, Mendenhall and Scheaffer, 2002, presumably done using
# exact calculations.

Thanks.  --Dale

On Thu, Feb 11, 2010 at 6:08 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> OK, I think I have it worked out for both cases:
>
> library(vcd)
>
> druns <- function(x, n) { # x runs in n data points, not vectorized
>                                                # based on sample median
>
>        if( n%%2 ) stop('n must be even')
>
>        if( x %% 2 ) { # odd x
>                choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1, n/2-(x-1)/2 )/
>                                choose(n,n/2) * 2
>        } else {                # even x
>                choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-x/2 )/
>                                choose(n,n/2) * 2
>        }
> }
>
> ## population median
> out1 <- replicate( 100000, {x <- rnorm(10); length(rle(sign(x))$lengths) } )
>
> rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
> chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
>
>
> ## sample median
> out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x); length(rle(sign(x))$lengths) } )
>
> fit <- sapply( 2:10, druns, n=10 )
>
> rootogram( table(out2), fit * 100000 )
> chisq.test( table(out2), p=fit )
>
>
> The tests only fail to prove me wrong, not a firm proof that I am correct.  But given the simulation size I am at least pretty close.  I can see why people worked out approximations before we had good computers, I would not want to calculate the above by hand.
>
> Enjoy,
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Greg Snow
>> Sent: Thursday, February 11, 2010 12:19 PM
>> To: Dale Steele; R-help at r-project.org
>> Subject: Re: [R] Code find exact distribution for runs test?
>>
>> I am not an expert in this area, but here are some thoughts that may
>> get you started towards an answer.
>>
>> First, there are 2 ways (possibly more) that can lead to the data for a
>> runs test that lead to different theoretical distributions:
>>
>> 1. We have a true or hypothesized value of the median that we
>> subtracted from the data, therefore each value has 50% probability of
>> being positive/negative and all are independent of each other (assuming
>> being exactly equal to the median is impossible or discarded).
>>
>> 2. We have subtracted the sample median from each sample value (and
>> discarded any 0's) leaving us with exactly half positive and half
>> negative and not having independence.
>>
>> In case 1, the 1st observation will always start a run.  The second
>> observation has a 50% chance of being the same sign (F) or different
>> sign (S), with the probability being 0.5 for each new observation and
>> them all being independent (under assumption of random selections from
>> population with known/hypothesized median) and the number of runs
>> equaling the number of S's, this looks like a binomial to me (with some
>> '-1's inserted in appropriate places.
>>
>> In case 2, this looks like a hypergeometric distribution, there would
>> be n!/((n/2)!(n/2)!) possible permutations, just need to compute how
>> many of those permutations result in x runs to get the probability.
>> There is probably a way to think about this in terms of balls and urns,
>> but I have not worked that out yet.
>>
>> Hope this helps,
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> Statistical Data Center
>> Intermountain Healthcare
>> greg.snow at imail.org
>> 801.408.8111
>>
>>
>> > -----Original Message-----
>> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> > project.org] On Behalf Of Dale Steele
>> > Sent: Wednesday, February 10, 2010 6:16 PM
>> > To: R-help at r-project.org
>> > Subject: [R] Code find exact distribution for runs test?
>> >
>> > I've been attempting to understand the one-sample run test for
>> > randomness.  I've found run.test{tseries} and run.test{lawstat}.
>> Both
>> > use a large sample approximation for distribution of the total number
>> > of runs in a sample of n1 observations of one type and n2
>> observations
>> > of another type.
>> >
>> > I've been unable to find R code to generate the exact distribution
>> and
>> > would like to see how this could be done (not homework).
>> >
>> > For example, given the data:
>> >
>> > dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12, -
>> 9,
>> > 6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
>> >
>> > The Monte Carlo permutation approach seems to get me part way.
>> >
>> >
>> > # calculate the number of runs in the data vector
>> > nruns <- function(x) {
>> >     signs <- sign(x)
>> >     runs <- rle(signs)
>> >     r <- length(runs$lengths)
>> >     return(r)
>> > }
>> >
>> > MC.runs <- function(x, nperm) {
>> > RUNS <- numeric(nperm)
>> > for (i in  1:nperm) {
>> >     RUNS[i] <- nruns(sample(x))
>> > }
>> >     cdf <- cumsum(table(RUNS))/nperm
>> >     return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
>> > }
>> >
>> > MC.runs(dtemp, 100000)
>> >
>> > Thanks.  --Dale
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-
>> > guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>