[R] Simple qqplot question

Bert Gunter gunter.berton at gene.com
Fri Jun 25 18:02:41 CEST 2010


To add to/modify what Joris (and I) previously said:

1. qqplots are not cumulative distribution plots. Hence, as Joris said, the
S-shape indicates short tails/bimodality  compared to the normal. Why you
continue to insist on carrying out normality tests that with so many points
obviously will reject is beyond me! The bimodality is what's important. Why
is it there? What is it telling you about your data (perhaps some sort of
measurement shift...)?

2. My prior suggestion for plotting a reference line -- and Joris's
confidence interval recommendations -- are in some sense wrong. The reason
is that they give the conditional expectation and confidence intervals
thereof of the quantiles of the "y" distribution conditioned on those of the
"x" . What you probably want is the "correlation" line. One simple "robust"
estimate of this -- and quick to calculate -- is just to mimic qqline() and
calculate the 1st and 3rd quartiles of both distributions and use the line
joining the corresponding quartile pairs ((1st,1st) and (3rd,3rd)) . I leave
the trivial algebra to you -- quantile() gets the quartiles. 

Of course, there's a literature on this if you want to do something
authoritative -- and perhaps R functions somewhere based on it. Perhaps some
kind (and wiser than I) soul will provide references. 

(However, I doubt that the line so obtained will differ appreciably from my
earlier "incorrect" recommendation, which was probably good enough for
eyeballing in most cases.)

Finally, risking hubris again, I would suggest that if the two distributions
with so many points really are essentially identical, then this is
scientifically "uninteresting" -- that is, the identity is a logical (and
trivial) consequence of the systematic way in which the data were obtained,
some sort of software (data collection?) issue, or the like -- i.e. not
indicative of a scientifically interesting phenomenon. It might even
indicate a problem with the data/measurements. My reasoning: real
variability prohibits such identity. The identical bimodality may be a clue
here. Again, note that I know nothing about what you are doing, and you are
therefore justified in publicly chastising me for such ignorant speculation
if I am wrong. 

I would welcome comments and criticisms from others on such speculation
also.

HTH,

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
 
 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Joris Meys
Sent: Friday, June 25, 2010 2:15 AM
To: Ralf B
Cc: R mailing list
Subject: Re: [R] Simple qqplot question

Sorry, missed the two variable thing. Go with the lm solution then,
and you can tweak the plot yourself (the confidence intervals are
easily obtained via predict(lm.object, interval="prediction") ). The
function qq.plot uses robust regression, but in your case normal
regression will do.

Regarding the shapes : this just indicates both tails are shorter than
expected, so you have a kurtosis greater than 3 (or positive,
depending whether you do the correction or not)

Cheers
Joris

On Fri, Jun 25, 2010 at 4:10 AM, Ralf B <ralf.bierig at gmail.com> wrote:
> Short rep: I have two distributions, data and data2; each build from
> about 3 million data points; they appear similar when looking at
> densities and histograms. I plotted qqplots for further eye-balling:
>
> qqplot(data, data2, xlab = "1", ylab = "2")
>
> and get an almost perfect diagonal line which means they are in fact
> very alike. Now I tried to check normality using qqnorm -- and I think
> I am doing something wrong here:
>
> qqnorm(data, main = "Q-Q normality plot for 1")
> qqnorm(data2, main = "Q-Q normality plot for 2")
>
> I am getting perfect S-shaped curves (??) for both distributions. Am I
> something missing here?
>
> |
> |                               *  *   *  *
> |                           *
> |                        *
> |                    *
> |               *
> |            *
> |         *
> | * * *
> |---------------------------------------------
>
> Thanks, Ralf
>



-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list