[R] About normality tests...

Bert Gunter gunter.berton at gene.com
Wed Jun 23 20:55:58 CEST 2010


Ralf:

Don't bother testing. You will reject normality.

But don't bother paying attention to the results of the normality testing
anyway -- normality testing is generally useless. (IMO -- others disagree).

DO pay attention to the plots; I would place a modest bet that you will find
that your data are not homogeneous with a strong central peak -- i.e. that
they may look more uniform-ish or even exhibit 2 or more modes, indicating
that you have a mixture of distributions. If true, this will have an
(possibly large) effect on statistical inference... and what this would mean
and what you should do depend very much on the substantive context in which
you are working (about which I know zip of course).

If I'm wrong in my guesses, please reply to the list so that everyone knows
(including me). Hubris begs comeuppance.

Finally, FWIW, 10000 is not considered "very large" these days; maybe
10,000,000,000 might be...

Cheers,

Bert Gunter
Genentech Nonclinical Biostatistics

 
 -----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Peter Ehlers
Sent: Wednesday, June 23, 2010 11:35 AM
To: Ralf B
Cc: r-help at r-project.org
Subject: Re: [R] About normality tests...

On 2010-06-23 12:05, Ralf B wrote:
> Hi all,
>
> I have two very large samples of data (10000+ data points) and would
> like to perform normality tests on it. I know that p<  .05 means that
> a data set is considered as not normal with any of the two tests. I am
> also aware that large samples tend to lead more likely to normal
> results (Andy Field, 2005).

I that depends on what you mean by 'tend to lead ...'

>
> I have a few questions to ensure that I am using them right.
>
> 1) The Shapiro-Wilk test requires to provide mean and sd. Is is
> correct to add here the mean and sd of the data itself (since I am
> comparing to a normal distribution with the same parameters) ?
>
> mySD<- sd(mydata$myfield)
> myMean<- mean(mydata$myfield)
> shapiro.test(rnorm(100, mean = myMean, sd = mySD))

I don't think that your understanding of the S-W test is correct.
You would just do:

  shapiro.test(mydata$myfield)

to test for Normality. However, shapiro.test() won't accept
sample sizes greater than 5000. So use ks.test. Or use a
graphical method: I like qq.plot in the 'car' package.

>
> 2) If I just want to test each distribution individually, I assume
> that I am doing a one-sample Kolmogorov-Smirnov test. Is that correct?

I don't understand this. What do you mean by 'test ... individually'?

>
> 3) If I simply want to know if normality exists or not, what should I
> put for the parameter 'alternative' ? Does it actually matter?
>
> alternative = c("two.sided", "less", "greater")

Leave it at the default 'two.sided' unless you have good
reason to suspect that the cdf lies above or below the Normal cdf.

   -Peter Ehlers

>
> Thank you,
> Ralf
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list