[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

peter dalgaard pdalgd at gmail.com
Mon May 29 10:02:03 CEST 2017


A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that

> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd



> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote:
> 
>> Tukey divides the points into three groups, not the x and y values
> separately.
> 
>> I'll try to get hold of the book for a direct quote, might take a couple
> of days.
> 
> Ah well, I can't get it for a week. But the fact that it's often called
> Tukey's three group line (try a search on *tukey three group line* and
> you'll get plenty of hits) is pretty much a giveaway.
> 
> 
> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote:
> 
>> Tukey divides the points into three groups, not the x and y values
>> separately.
>> 
>> I'll try to get hold of the book for a direct quote, might take a couple
>> of days.
>> 
>> 
>> 
>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
>> wrote:
>> 
>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>> 
>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>>> or
>>>> 3
>>>> 
>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>>> intercept -1 and slope 1.2
>>>> 
>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>>> of
>>>> length 6, with four of every six correct.
>>>> 
>>>> Bug has been present across many versions.
>>>> 
>>>> The machine I just tried it on just now has R3.2.3:
>>>> 
>>> 
>>> If you look at the source (in src/library/stats/src/line.c), the
>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>> to a particular definition of quantile), and the y value is chosen as the
>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>> Those are different definitions (though I think they would be
>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>> line ends up "wrong".
>>> 
>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>> is wrong.
>>> 
>>> Duncan Murdoch
>>> 
>>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list