[R] shapiro.test

Sat Feb 22 11:44:40 CET 2014

Hello,

Inline

Em 21-02-2014 23:13, Rolf Turner escreveu:
> On 22/02/14 11:04, Rui Barradas wrote:
>> Hello,
>>
>> Not answering directly to your question, if the sample size is a
>> documented problem with shapiro.test and you want a normality test, why
>> don't you use ?ks.test?
>>
>> m <- mean(HP_TrinityK25$V2)
>> s <- sd(HP_TrinityK25$V2)
>>
>> ks.test(HP_TrinityK25$V2, "pnorm", m, s)
>
> Strictly speaking this is not a valid test.  The KS test is used for
> testing against a *completely specified* distribution.  If there are
> parameters to be estimated, the null distribution is no longer
> applicable.  This may not be a "real" problem if the parameters are
> *well* estimated, as they would be in this instance (given that the
> sample size is over-large).  I'm not sure about this.

Yes, you're right. I hesitated before posting my answer precisely 
because of this, the parameters must be pre-determined constants, not 
computed from the data. Like Greg pointed out in his reply, the help 
page for ?ks.test also explicitly refers to it (which I had missed).

The chi-squared gof test seems to be a good choice, given the sample size.

Rui Barradas
>
> The "Lilliefors" test is theoretically available in this context when
> mu and sigma are estimated, but according to the Wikipedia article, the
> Lilliefors distribution is not known analytically and the critical
> values must be determined by Monte Carlo methods.  There is a
> "LillieTest" function in the "DescTools" package which makes use of some
> approximations to get p-values.
>
> However I think that a better approach would be to use a chi-squared
> goodness of fit test whereby you can adjust for estimated parameters
> simply by reducing the degrees of freedom.  I believe that the
> chi-squared test is somewhat low in power, but with a very large sample
> this should not be a problem.
>
> The difficulty with the chi-squared test is that the choice of "bins" is
> somewhat arbitrary.  I believe the best approach is to take the bin
> boundaries to be the quantiles of the normal distribution (with
> parameters "m" and "s") corresponding to equispaced probabilities on
> [0,1], with the number of such probabilities being k+1 where
> k = floor(n/5), n being the sample size.  This makes the expected counts
> all equal to n/k >= 5 so that the chi-squared test is "valid".  The
> degrees of freedom are then k-3 (k - 1 - #estimated parameters).
>
> One last comment:  I believe that it is generally considered that
> testing for normality is a waste of time and a pseudo-intellectual
> exercise of academic interest at best.
>
> cheers,
>
> Rolf Turner
>
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
>>> Dear R users,
>>> Please help with with this maybe basic question. I am trying to see
>>> if my
>>> data is normal but is a large file and the test does not work.
>>> I keep getting the message : "Error in shapiro.test(x =
>>> HP_TrinityK25$V2)
>>> :  sample size must be between 3 and 5000"
>>> thanks!
>>>
>>>   shapiro.test(x=HP_TrinityK25$V2)
>>> Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
>>> between 3
>>> and 5000
>>>
>>> ##Note:
>>> HP_TrinityK25= my file
>>> HP_TrinityK25$V2= data in my file
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>