[R] Difficulty with qqline in logarithmic context

François Pinard pinard at iro.umontreal.ca
Fri Feb 3 20:08:32 CET 2006


[Brian Ripley]
>Is there a good reason to use qqnorm in a single-log context?

Yes.  Googling around reveals this is not so uncommon.

> Should one not rather use

>>qqnorm(log(freq))
>>qqline(log(freq))

In the display produced by "qqnorm", the y-axis would then show 
"log(value)" labels, while the user (me!) expects "value" labels.

>since you are (I guess) looking at log-normality of freq?

Once again, I was merely toying with "qqplot".  I found intriguing that, 
while shuffling messages around between folders, for a good while, the 
distribution of log(number of messages) per folder appears vagueley 
normal, as I do not quickly see a reasonable justification for this.

>Another way to look at that is

>>qqplot(qlnorm(ppoints(length(freq))), freq, log="xy")

>the same plot, different scales.

Interesting, thanks for teaching me about "ppoints".  Yet, I stay more 
happy with the abcissa scale produced by "qqnorm".  Besides, how would 
one uses "qqline" with the above?

>(I believe a QQ plot should always have comparable scales on the two 
>axes.)

While comparable scales are somewhat simpler to compare, this is not 
necessarily what is most adequate for the user.  Proof is that while 
quantiles are being compared here, scales do not show quantiles, but 
units as meaningful to the user.  One might want to compare variables 
scaled very differently, maybe because of different units from the same 
distribution, of from different but similar distributions using 
different scales and shifted to different means.  Or even, why not, if 
this is what is meaningful for users, a log scale.

>The point is that qqline is tied to normality, not to log-normality.

As it stands, yes.  As a convenience, it could be extended (probably 
easily) to log-normality.  "qqnorm" already does something sensible in 
log-context, so a user might expect "qqline" to do equally well.

The real point might be that "qqline" is tied to "abline" a bit too 
blindly.  What is the meaning of intercept and slope of a straight line 
on a graphic in log context?  First, the intercept might not even exist.  
Second, "abline" interpretation depends on the clippling, and possibly 
on the extrema of the pretty breakpoints chosen for scales, so making it 
hard to predict on average use.   There ought to be some reason for the 
log-aware code in "abline", yet I did not find documentation for it.

The wisest for "abline", in my very humble opinion, would be for it to 
complain if ever called in log context.  Then, "qqline" would indirectly 
complain through "abline", if "qqline" is not modified to do something 
more proper.  Moreover, if it is definitely out of question that 
"qqline" be ever meaningfully called in log context, then so "qqnorm", 
which should then complain as well.

Currently, "qqline" misbehaves, in that it silently produces 
a meaningless result, while it could either diagnose that the result is 
meaningless, or produce a mearningful result.


[Remainder of the reply top-quoted, as usual on r-help.]

>On Wed, 1 Feb 2006, François Pinard wrote:

>>Hi, R friends.  I had some difficulty with the following code:

>>   qqnorm(freq, log='y')
>>   qqline(freq)

>>as the line drawn was seemingly random.  The exact data I used appears
>>below.  After wandering a bit within the source code for "abline",
>>I figured out I should rather write:

>>   qqnorm(freq, log='y')
>>   par(ylog=FALSE)
>>   qqline(log10(freq))
>>   par(ylog=TRUE)

>>I'm proposing that this little stunt be rather be hidden and
>>automatically effected within "qqline" proper, whenever par('ylog') is
>>TRUE.  I thought about providing a patch, as "qqline" is so small.  Yet
>>it would be more noise than useful, as I'm not familiar with the "datax"
>>argument usage, which should probably be addressed as well.



>>Here is the data, in case useful:

>>freq <-
>>as.integer(c(33, 79, 21, 436, 58, 18, 1106, 498, 1567, 393, 2,
>>104, 50, 67, 113, 76, 327, 331, 196, 145, 86, 59, 12, 215, 293,
>>154, 500, 314, 246, 587, 85, 23, 323, 3, 13, 576, 29, 37, 24,
>>21, 1230, 137, 13, 93, 3, 101, 72, 218, 59, 17, 2, 8, 86, 143,
>>150, 22, 19, 234, 119, 157, 4, 255, 146, 126, 76, 15, 271, 170,
>>4, 6, 16, 3048, 2175, 3350, 5017, 5706, 1610, 665, 322, 1, 16,
>>47, 51, 168, 94, 66, 154, 99, 11, 547, 953, 1, 1071, 80, 184,
>>168, 52, 187, 103, 187, 361, 46, 85, 135, 597, 121, 283, 26,
>>12, 20, 169, 9, 79, 15, 114, 75, 30, 111, 556, 173, 32, 99, 438,
>>2, 2, 1, 117, 5, 3, 51, 8, 41, 12, 23, 2, 13, 5, 1, 9, 4, 1,
>>7, 15, 5, 48, 16, 112, 6, 1, 39, 60, 5, 23, 5, 19, 1, 8, 32,
>>4, 13, 1, 14, 71, 5, 1, 35, 30, 100, 389, 22, 8, 1, 192, 40,
>>6, 3, 17, 2, 14, 71, 14, 1, 5, 4, 32, 21, 18, 13, 2, 2, 45, 342,
>>46, 144, 18, 131, 188, 112, 37, 85, 90, 8, 195, 173, 5, 53, 96,
>>37, 16, 16, 281, 64, 50, 92, 336, 31, 744, 4, 134, 74, 1, 227,
>>6, 48, 418, 64, 66, 59, 20, 45, 20, 370, 148, 22, 7, 30, 601,
>>29, 82, 113, 938, 252, 65, 137, 72, 22, 98, 12, 152, 212, 13,
>>8, 35, 3, 77))

>>Yet this really is the value of "courriel$freq" after "data(courriel)",
>>with a file ".../R/data/courriel.R" here, holding:

>>courriel <- read.table(pipe('grep -c \'^From \' ../courriel/*'),
>>                       sep=':', as.is=T, row.names=1,
>>                       col.names=c('fichier', 'freq'))

>>My goal, which is nothing serious, was merely to toy with the number of
>>messages per folder, for folders massaged out of R archives.



>>Version:
>>platform = i686-pc-linux-gnu
>>arch = i686
>>os = linux-gnu
>>system = i686, linux-gnu
>>status =
>>major = 2
>>minor = 2.1
>>year = 2005
>>month = 12
>>day = 20
>>svn rev = 36812
>>language = R

>>Locale:
>>LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=fr_CA.UTF-8;LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C

>>Search Path:
>>.GlobalEnv, package:methods, package:stats, package:graphics, 
>>package:grDevices, package:utils, package:datasets, fp.etc, Autoloads, 
>>package:base


>>-- 
>>François Pinard   http://pinard.progiciels-bpi.ca

>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! 
>>http://www.R-project.org/posting-guide.html


>-- 
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595


-- 
François Pinard   http://pinard.progiciels-bpi.ca




More information about the R-help mailing list