[R] normality test for large sample sizes

Greg Snow Greg.Snow at intermountainmail.org
Fri Apr 13 17:46:37 CEST 2007


The thing that you should be aware of is that often the results of
normality tests for very large datasets are meaningless.  Why do you
want to know if the data is normal?

A couple of issues, with very large datasets you have power to detect
very minor deviations from normality, and since no real dataset that
will be analyzed is truly normal, this will generally give you a
significant result.

On the other hand, with really large datasets the central limit theorem
comes into play and for common analyses (regression, t-tests, ...) you
really don't care if the population is normally distributed or not.

The good rule of thumb is to do a qq-plot and ask, is this normal
enough?  Rather than depending on formal tests of hypothesis.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 
 

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Morgan Hough
> Sent: Friday, April 13, 2007 3:54 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] normality test for large sample sizes
> 
> I was wondering if it was possible to do a normality test on 
> a very large sample dataset. It is approx. 160,000 residual 
> estimates from meshes modelling the brain surfaces of 50 
> subjects (25 patients). 
> shapiro.test only works with at most 5000 points. Are there 
> issues with very large samples sizes that I should be aware of?
> 
> Cheers,
> 
> -Morgan
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list