r.turner at auckland.ac.nz
Sat Feb 22 00:13:29 CET 2014
On 22/02/14 11:04, Rui Barradas wrote:
> Not answering directly to your question, if the sample size is a
> documented problem with shapiro.test and you want a normality test, why
> don't you use ?ks.test?
> m <- mean(HP_TrinityK25$V2)
> s <- sd(HP_TrinityK25$V2)
> ks.test(HP_TrinityK25$V2, "pnorm", m, s)
Strictly speaking this is not a valid test. The KS test is used for
testing against a *completely specified* distribution. If there are
parameters to be estimated, the null distribution is no longer
applicable. This may not be a "real" problem if the parameters are
*well* estimated, as they would be in this instance (given that the
sample size is over-large). I'm not sure about this.
The "Lilliefors" test is theoretically available in this context when
mu and sigma are estimated, but according to the Wikipedia article, the
Lilliefors distribution is not known analytically and the critical
values must be determined by Monte Carlo methods. There is a
"LillieTest" function in the "DescTools" package which makes use of some
approximations to get p-values.
However I think that a better approach would be to use a chi-squared
goodness of fit test whereby you can adjust for estimated parameters
simply by reducing the degrees of freedom. I believe that the
chi-squared test is somewhat low in power, but with a very large sample
this should not be a problem.
The difficulty with the chi-squared test is that the choice of "bins" is
somewhat arbitrary. I believe the best approach is to take the bin
boundaries to be the quantiles of the normal distribution (with
parameters "m" and "s") corresponding to equispaced probabilities on
[0,1], with the number of such probabilities being k+1 where
k = floor(n/5), n being the sample size. This makes the expected counts
all equal to n/k >= 5 so that the chi-squared test is "valid". The
degrees of freedom are then k-3 (k - 1 - #estimated parameters).
One last comment: I believe that it is generally considered that
testing for normality is a waste of time and a pseudo-intellectual
exercise of academic interest at best.
> Hope this helps,
> Rui Barradas
> Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
>> Dear R users,
>> Please help with with this maybe basic question. I am trying to see if my
>> data is normal but is a large file and the test does not work.
>> I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
>> : sample size must be between 3 and 5000"
>> Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
>> between 3
>> and 5000
>> HP_TrinityK25= my file
>> HP_TrinityK25$V2= data in my file
>> [[alternative HTML version deleted]]
>> R-help at r-project.org mailing list
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.
> R-help at r-project.org mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help