ruipbarradas at sapo.pt
Sat Feb 22 11:44:40 CET 2014
Em 21-02-2014 23:13, Rolf Turner escreveu:
> On 22/02/14 11:04, Rui Barradas wrote:
>> Not answering directly to your question, if the sample size is a
>> documented problem with shapiro.test and you want a normality test, why
>> don't you use ?ks.test?
>> m <- mean(HP_TrinityK25$V2)
>> s <- sd(HP_TrinityK25$V2)
>> ks.test(HP_TrinityK25$V2, "pnorm", m, s)
> Strictly speaking this is not a valid test. The KS test is used for
> testing against a *completely specified* distribution. If there are
> parameters to be estimated, the null distribution is no longer
> applicable. This may not be a "real" problem if the parameters are
> *well* estimated, as they would be in this instance (given that the
> sample size is over-large). I'm not sure about this.
Yes, you're right. I hesitated before posting my answer precisely
because of this, the parameters must be pre-determined constants, not
computed from the data. Like Greg pointed out in his reply, the help
page for ?ks.test also explicitly refers to it (which I had missed).
The chi-squared gof test seems to be a good choice, given the sample size.
> The "Lilliefors" test is theoretically available in this context when
> mu and sigma are estimated, but according to the Wikipedia article, the
> Lilliefors distribution is not known analytically and the critical
> values must be determined by Monte Carlo methods. There is a
> "LillieTest" function in the "DescTools" package which makes use of some
> approximations to get p-values.
> However I think that a better approach would be to use a chi-squared
> goodness of fit test whereby you can adjust for estimated parameters
> simply by reducing the degrees of freedom. I believe that the
> chi-squared test is somewhat low in power, but with a very large sample
> this should not be a problem.
> The difficulty with the chi-squared test is that the choice of "bins" is
> somewhat arbitrary. I believe the best approach is to take the bin
> boundaries to be the quantiles of the normal distribution (with
> parameters "m" and "s") corresponding to equispaced probabilities on
> [0,1], with the number of such probabilities being k+1 where
> k = floor(n/5), n being the sample size. This makes the expected counts
> all equal to n/k >= 5 so that the chi-squared test is "valid". The
> degrees of freedom are then k-3 (k - 1 - #estimated parameters).
> One last comment: I believe that it is generally considered that
> testing for normality is a waste of time and a pseudo-intellectual
> exercise of academic interest at best.
> Rolf Turner
>> Hope this helps,
>> Rui Barradas
>> Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
>>> Dear R users,
>>> Please help with with this maybe basic question. I am trying to see
>>> if my
>>> data is normal but is a large file and the test does not work.
>>> I keep getting the message : "Error in shapiro.test(x =
>>> : sample size must be between 3 and 5000"
>>> Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
>>> between 3
>>> and 5000
>>> HP_TrinityK25= my file
>>> HP_TrinityK25$V2= data in my file
>>> [[alternative HTML version deleted]]
>>> R-help at r-project.org mailing list
>>> PLEASE do read the posting guide
>>> and provide commented, minimal, self-contained, reproducible code.
>> R-help at r-project.org mailing list
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help