[R] Non-parametric test for location with two unpaired sets of data measured on ordinal scale.

Fri Sep 25 20:29:15 CEST 2009

Thanks Marc,

The sampling is so easy that I often forget that we can do the exact permutation test for smaller samples (and I can never remember when small is small enough for this).  With the exact permutations we really don't need to do the prop.test or binom.test, I usually do that to get the confidence interval on the p-value due to sampling from the permutations rather than doing all possible (and this tells me if I need to increase the number of permutations to be sure my p-value is precise enough).  With all possible permutations, there is no sampling, and no need for an interval, the p-value is exact.

Thanks again, I need to remember combn.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at me.com]
> Sent: Friday, September 25, 2009 12:17 PM
> To: Greg Snow
> Cc: John Sorkin; r-help at r-project.org
> Subject: Re: [R] Non-parametric test for location with two unpaired
> sets of data measured on ordinal scale.
> 
> Greg and John,
> 
> Just to throw it out there, the data sets here are small enough that
> you co do a fully enumerable permutation test by replacing your
> replicate() call with:
> 
>    perms <- combn(17, 9, function(x) median(sets[x]) - median(sets[-
> x]))
> 
> This is based on an off-list communication that I had with Peter
> Dalgaard about 3 years ago for a different scenario and gives you:
> 
>  > choose(17, 9)
> [1] 24310
> 
> permutations.
> 
> 
> It does not take long:
> 
>  > system.time(perms <- combn(17, 9, function(x) median(sets[x]) -
> median(sets[-x])))
>     user  system elapsed
>    3.863   0.019   3.898
> 
> 
> Which yields:
> 
>  > str(perms)
>   num [1:24310(1d)] -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
> 
> 
>  > table(perms)
> perms
>    -2 -1.5   -1 -0.5    0  0.5    1  1.5    2
>   285  175 2595 7000 8050  875 3720 1425  185
> 
> 
> perms <- c(orig,perms)
> 
> prop.test( sum(perms>=orig), length(perms) )
> # or
> binom.test(sum(perms >= orig), length(perms))
> 
> 
> # Variation on the graphic...
> plot(table(perms), type = "h")
> abline(v = orig, col = "blue")
> 
> 
> See ?combn for more information.
> 
> HTH,
> 
> Marc Schwartz
> 
> 
> 
> On Sep 25, 2009, at 11:47 AM, Greg Snow wrote:
> 
> > Yes, I agree that the median makes the most sense here, but there
> > could be other measures of location that would be of interest
> > (quartiles, some version of the rank sum).
> >
> > Here is some sample code for a permutation test on the medians
> > (there are a couple of packages that will do this as well, but this
> > is pretty straight forward with straight R code):
> >
> > set1 <- c(1,3,2,2,4,3,3,2,2)
> > set2 <- c(4,4,4,3,3,5,4,4)
> >
> > sets <- c(set1,set2)
> > g1 <- seq_along(set1)
> >
> > orig <- median( sets[ -g1 ] ) - median( sets[ g1 ] )
> >
> > perms <- replicate( 1999, { tmp <- sample(sets)
> > 	median( tmp[ -g1 ] ) - median( tmp[ g1 ] ) } )
> >
> > #  or
> >
> > pb <- winProgressBar(max=1999)
> >
> > setWinProgressBar(pb, 0)
> >
> > perms <- replicate(1999, { setWinProgressBar( pb,
> > getWinProgressBar(pb) + 1 )
> > 	tmp <- sample(sets)
> > 	median( tmp[ -g1 ] ) - median( tmp[ g1 ] ) } )
> >
> > close(pb)
> >
> >
> > perms <- c(orig,perms)
> > sum( perms >= orig )
> > mean( perms >= orig )
> > prop.test( sum(perms>=orig), length(perms) )
> > hist(perms)
> > abline(v=orig, col='blue')
> >
> > (if you want the progress bar on an os other than windows, then use
> > the tcltk package and the tkProgressBar).
> >
> > Hope this helps,
> >
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: John Sorkin [mailto:jsorkin at grecc.umaryland.edu]
> >> Sent: Thursday, September 24, 2009 2:52 PM
> >> To: Greg Snow; r-help at r-project.org
> >> Subject: Re: [R] Non-parametric test for location with two unpaired
> >> setsof data measured on ordinal scale.
> >>
> >> Greg,
> >> I used the term location because I did not want to use the terms
> mean
> >> or median for the exact reason that you gave; these to values can be
> >> different in a given distribution. I want to test the null
> hypothesis
> >> that the data come from a single distribution. This is often done by
> >> comparing a measure of location (e.g. mean for ANOVOA), but as you
> >> know
> >> the mean need not be the only measure of location that is tested.
> >> Giiven that my data are measured on an ordinal scale, the mean is
> >> without meaning, so I suspect that the best measure for me would be
> a
> >> comparison of medians, but I am open to other suggestions.
> >> John
> >>
> >> John David Sorkin M.D., Ph.D.
> >> Chief, Biostatistics and Informatics
> >> University of Maryland School of Medicine Division of Gerontology
> >> Baltimore VA Medical Center
> >> 10 North Greene Street
> >> GRECC (BT/18/GR)
> >> Baltimore, MD 21201-1524
> >> (Phone) 410-605-7119
> >> (Fax) 410-605-7913 (Please call phone number above prior to faxing)
> >>
> >>>>> Greg Snow <Greg.Snow at imail.org> 9/24/2009 4:30 PM >>>
> >> What do you mean by location?  I can think of examples where 2
> >> distributions have the same median but different means, or the same
> >> means but different medians.
> >>
> >> Are you willing to assume that the distributions are exactly the
> same
> >> under the null hypothesis? (not just the same 'center/location')
> >>
> >> I would probably do a permutation test on the difference between the
> >> means or medians (which ever you think is more meaningful), this
> >> assumes the exact same distribution under the null.
> >>
> >> You can also do a Mann-Whitney/Wilcoxin test (but I don't like
> >> explaining, or sometimes even thinking about, what it is actually
> >> testing), you could do a bootstrap confidence interval on the
> >> difference between means/medians (does not assume distributions are
> >> the
> >> same, just have same mean/median), or you could just replace all
> >> values
> >> by their ranks and do a t-test (essentially transforms the data to a
> >> uniform distribution, the CLT for the uniform kicks in around n=5,
> >> but
> >> I would simulate just to check).
> >>
> >> This is not the nice simple answer that you were probably looking
> >> for,
> >> but hopefully it gives you some things to think about that will
> help,
> >>
> >> --
> >> Gregory (Greg) L. Snow Ph.D.
> >> Statistical Data Center
> >> Intermountain Healthcare
> >> greg.snow at imail.org
> >> 801.408.8111
> >>
> >>
> >>> -----Original Message-----
> >>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >>> project.org] On Behalf Of John Sorkin
> >>> Sent: Thursday, September 24, 2009 1:08 PM
> >>> To: r-help at r-project.org
> >>> Subject: [R] Non-parametric test for location with two unpaired
> sets
> >> of
> >>> data measured on ordinal scale.
> >>>
> >>> Please forgive a stats question.
> >>>
> >>> I have to sets of data (unpaired) measured on an ordinal scale. I
> >> want
> >>> to test to see if the two sets are different (i.e. do they have the
> >>> same location):
> >>>
> >>> set1: 1,3,2,2,4,3,3,2,2
> >>> set:   4,4,4,3,3,5,4,4
> >>>
> >>> What is the most appropriate non-parametric test to test location?
> >>>
> >>> Thanks,
> >>> John
> >>>
> >>> Confidentiality Statement:
> >>> This email message, including any attachments, is for
> >>> th...{{dropped:6}}
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-
> >>> guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> Confidentiality Statement:
> >> This email message, including any attachments, is for ...{{dropped:
> >> 6}}
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.