[R] shapiro.test

Sat Feb 22 00:13:29 CET 2014

On 22/02/14 11:04, Rui Barradas wrote:
> Hello,
>
> Not answering directly to your question, if the sample size is a
> documented problem with shapiro.test and you want a normality test, why
> don't you use ?ks.test?
>
> m <- mean(HP_TrinityK25$V2)
> s <- sd(HP_TrinityK25$V2)
>
> ks.test(HP_TrinityK25$V2, "pnorm", m, s)

Strictly speaking this is not a valid test.  The KS test is used for 
testing against a *completely specified* distribution.  If there are 
parameters to be estimated, the null distribution is no longer 
applicable.  This may not be a "real" problem if the parameters are 
*well* estimated, as they would be in this instance (given that the 
sample size is over-large).  I'm not sure about this.

The "Lilliefors" test is theoretically available in this context when
mu and sigma are estimated, but according to the Wikipedia article, the 
Lilliefors distribution is not known analytically and the critical 
values must be determined by Monte Carlo methods.  There is a 
"LillieTest" function in the "DescTools" package which makes use of some 
approximations to get p-values.

However I think that a better approach would be to use a chi-squared 
goodness of fit test whereby you can adjust for estimated parameters 
simply by reducing the degrees of freedom.  I believe that the 
chi-squared test is somewhat low in power, but with a very large sample 
this should not be a problem.

The difficulty with the chi-squared test is that the choice of "bins" is 
somewhat arbitrary.  I believe the best approach is to take the bin 
boundaries to be the quantiles of the normal distribution (with 
parameters "m" and "s") corresponding to equispaced probabilities on 
[0,1], with the number of such probabilities being k+1 where
k = floor(n/5), n being the sample size.  This makes the expected counts 
all equal to n/k >= 5 so that the chi-squared test is "valid".  The 
degrees of freedom are then k-3 (k - 1 - #estimated parameters).

One last comment:  I believe that it is generally considered that 
testing for normality is a waste of time and a pseudo-intellectual 
exercise of academic interest at best.

cheers,

Rolf Turner

>
>
> Hope this helps,
>
> Rui Barradas
>
> Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
>> Dear R users,
>> Please help with with this maybe basic question. I am trying to see if my
>> data is normal but is a large file and the test does not work.
>> I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
>> :  sample size must be between 3 and 5000"
>> thanks!
>>
>>   shapiro.test(x=HP_TrinityK25$V2)
>> Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
>> between 3
>> and 5000
>>
>> ##Note:
>> HP_TrinityK25= my file
>> HP_TrinityK25$V2= data in my file
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.