[Rd] ks.test doesn't compute correct empirical distribution if there are ties in the data (PR#1007)

mcdowella@mcdowella.demon.co.uk mcdowella@mcdowella.demon.co.uk
Sun, 1 Jul 2001 07:53:54 +0200 (MET DST)


Full_Name: Andrew Grant McDowell
Version: R 1.1.1 (but source in 1.3.0 looks fishy as well)
OS: Windows 2K Professional (Consumer)
Submission from: (NULL) (194.222.243.209)


In article <xeQ_6.1949$xd.353840@typhoon.snet.net>,
johnt@tman.dnsalias.com writes
>Can someone help?  In R, I am generating a vector of 1000 samples from 
>Bin (1000, 0.25).  I then do a Kolmogorov Smirnov test to test if the 
>vector has been drawn from a population of Bin (1000, 0.25).  I would
>expect a reasonably high p-value.....
>
>Either I am doing something wrong in R, or I am misunderstanding how this
>test should work (both quite possible)...
>
>
>Thanks,
>JT..
>
>
>
>> #### 1000 random samples from binomial dist with mean =.25, n=100...
>> o<-rbinom (1000, 100, .25)
>> mean (o);
>[1] 25.178
>> var (o);
>[1] 19.61193
>> ks.test (o, "pbinom", 100, .25);
>
>        One-sample Kolmogorov-Smirnov test 
>
>data:  o 
>D = 0.0967, p-value = 1.487e-08 
>alternative hypothesis: two.sided
>
>
>
>p-value is mighty small, leading me to reject the null hypothesis that
>the sample has been drawn from the Bin(100, 0.25) distribution!!!
>
>
>

Some more oddities:

> o<-rbinom(10000, 1, 0.25)
> ks.test(o, "pbinom", 1, 0.25)

         One-sample Kolmogorov-Smirnov test 

data:  o 
D = 0.75, p-value = < 2.2e-16 
alternative hypothesis: two.sided 

> length(o[o==0])
[1] 7491
> length(o[o==1])
[1] 2509
> o<-rep(0,10000)
> ks.test(o, "pbinom", 1, 0.25)

         One-sample Kolmogorov-Smirnov test 

data:  o 
D = 0.75, p-value = < 2.2e-16 
alternative hypothesis: two.sided 

> length(o[o==0])
[1] 10000
> length(o[o==1])
[1] 0

Here zeroing out the data does not change the reported D value

After playing about with
ks.test(c(rep(0, X), rep(1, 1000-x)), "pbinom", 1, p)
for a bit I conjecture that ks.test() takes no account
whatsoever of ties, but merely sorts the input values
and looks for max (position/N - pbinom(value, 1, p)).
Anybody got the source handy?
-- 
A. G. McDowell

After 30 minutes of download, the relevant part of ks.test.R would appear to be

        METHOD <- "One-sample Kolmogorov-Smirnov test"
        n <- length(x)
        x <- y(sort(x), ...) - (0 : (n-1)) / n
        STATISTIC <- switch(alternative,
                            "two.sided" = max(c(x, 1/n - x)),
                            "greater" = max(1/n - x),
                            "less" = max(x))

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._