[R] ks.test - continuous vs discrete

Torsten Hothorn Torsten.Hothorn at rzmail.uni-erlangen.de
Wed Mar 27 16:15:11 CET 2002


> 
> I frequently want to test for differences between animal size frequency
> distributions.  The obvious test (I think) to use is the Kolmogorov-Smirnov
> two sample test (provided in R as the function ks.test in package ctest).

"obvious" depends on the problem you want to test: KS tests the hypothesis

H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z 

ks.test assumes that both F and G are continuous variables. However, if
you want to test

H_0: F(z) = G(z)  vs. H_1: F(z) = G(z - delta); delta != 0

as "test for differences" indicates, the Wilcoxon rank sum test is
"obvious". Or, more general, if your hypothesis is "exchangeability", a
permutation test can be used.

> The KS test is for continuous variables and this obviously includes length,
> weight etc.  However, limitations in measuring (e.g length to the nearest
> cm/mm, weight to the nearest g/mg etc) has the obvious effect of
> "discretising" real data.

or maybe the underlying distribution is discrete? 

Anyway: ks.test and wilcox.test in ctest assume data from continuous
distributions and the normal approximation is used if ties occur. 

For the Wilcoxon and permutation test, the conditional distribution (that
is: conditional on the ties) can be computed using the exactRankTests
package.

> 
> The ks.test function checks for the presence of ties noting in the help page
> that "continuous distributions do not generate them".  Given the problem of
> "measuring to the nearest..." noted above I frequently find that my data has
> ties and ks.test generates a warning.
> I was interested to note that the example of a two-sample KS test given in
> Sokal & Rohlf's "Biometry" (I have the 2nd edition where the example is on
> p.441) has exactly the same problem:
> > A <- c(104,109,112,114,116,118,118,117,121,123,125,126,126,128,128,128)
> > B <- c(100,105,107,107,108,111,116,120,121,123)

For your example: 

R> library(exactRankTests)
R> wilcox.exact(B, A)

        Exact Wilcoxon rank sum test

data:  B and A 
W = 36.5, p-value = 0.02039
alternative hypothesis: true mu is not equal to 0 


R> perm.test(B, A)

        2-sample Permutation Test

data:  B and A 
T = 1118, p-value = 0.01864
alternative hypothesis: true mu is not equal to 0 

Torsten

> > ks.test(A,B)
> 
>         Two-sample Kolmogorov-Smirnov test
> 
> data:  A and B 
> D = 0.475, p-value = 0.1244
> alternative hypothesis: two.sided 
> 
> Warning message: 
> cannot compute correct p-values with ties in: ks.test(A, B)
> In their chapter 2, "Data in Biology", Sokal & Rohlf note "any given reading
> of a continuous variable ... is therefore an approximation to the exact
> reading, which is in practice unknowable.  However, for the purposes of
> computation these approximations are usually sufficient..."
> I am interested to know whether this can be made more exact.  Are there
> methods to test that data are measured at an appropriate scale so as to be
> regarded as sufficiently continuous for a KS test, or is common sense choice
> of measurement precision widely regarded as sufficient?
> Any comments/references would be appreciated!
> David Middleton
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> 

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list