[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
GlenB
Wed May 31 06:13:31 CEST 2017
Martin Maechler says in reply to Sergueï Sokol
> Note the 'Subject' you've chosen for this thread,
"... does not produce the correct Tukey line",
The choice of title was mine not Serguei's; I posted the original message
where the error was pointed out
I agree with Martin's assessment that the correct split (both by Tukey's
lights and by general practice)
for 19 points would be 6,7,6 and I also agree that it's better to "fix
more" in this instance, where possible.
(e.g. Johnstone&Velleman's standard errors would be a nice thing to add if
feasible) --
but if any blame is attached to the choice of title, it really should be
aimed at me.
Glen
On Wed, May 31, 2017 at 2:51 AM, Martin Maechler <maechler at stat.math.ethz.ch
> wrote:
> >>>>> Serguei Sokol <sokol at insa-toulouse.fr>
> >>>>> on Tue, 30 May 2017 16:01:17 +0200 writes:
>
> > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ...
> >> However, even after the patch, The example from the SO
> >> post differs from the result of Richie Cotton's
> >> function...
> > The explanation is quite simple. In SO function, the first
> > 1/3 quantile of used example counts 6 points (of 19 in
> > total), while line()'s definition of quantile leads to 8
> > points. The same numbers (6 and 8) are on the other end of
> > sample.
>
> so the number of obs. for the three thirds for line() are
> {8, 3, 8} in line() [also, after your patch, right?]
>
> whereas in MMline() they are as they should be, namely
>
> {6, 7, 6}
>
> But the {8, 3, 8} split is not at all what all "the literature",
> including Tukey himself says that "should" be done.
> (Other literature on the topic suggests that the optimal sizes
> of the split in three groups depends on the distribution of x ..)
>
> OTOH, MMline() does exactly what "the literature" and also the
> reference on the ?line help pages says.
>
> > In x sample, there are few repeated values, this
> > is certainly be the reason of different quantiles..
>
> > I am not sure that one quantile definition is better or
> > more correct than the other.
>
> > So I would leave line()'s definition as is.
>
> you mean _after_ applying your patch, I assume.
>
> I currently tend do disagree. If we change line() we should
> rather fix more ..
> Note the 'Subject' you've chosen for this thread,
> "... does not produce the correct Tukey line",
> so I think we should get better.
>
> Apart from Richie / my MMline() function, I've also noticed
> that ACSWR :: resistant_line()
> exists.
>
> However "the literature" (see references below), notably the two
> with Hoaglin, strongly recommends smarter iterations, and
> -- lo and behold! -- when this topic came up last (for me) in
> Dec. 2014, I did spend about 2 days work (or more?) to get the
> FORTRAN code from the 1981 - book (which is abbreviated the
> "ABC of EDA") from a somewhat useful OCR scan into compilable
> Fortran code and then f2c'ed, wrote an R interface function
> found problems i.e., bugs, including infinite loops, fixed most
> AFAICS, but somehow did not finish making the result available.
>
> Yes, and I have too many other things on my desk... this will
> have to wait!
>
> References:
>
> Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading
> Massachusetts: Addison-Wesley.
>
> Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics
> and Computing of Exploratory Data Analysis_ Duxbury Press.
>
> Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y
> versus x. Chapter 5 of _Understanding Robust and Exploratory Data
> Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W.
> Tukey. Wiley.
>
> Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line
> and Related Regression Methods. _Journal of the American
> Statistical Association_ *80*, 1041-1054. <URL:
> https://dx.doi.org/10.1080/01621459.1985.10478222>
>
>
> > Best, Sergueï.
>
> Martin Maechler, ETH Zurich (and R core team)
>
>
