[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

peter dalgaard pdalgd at gmail.com
Wed May 31 16:57:59 CEST 2017


> On 31 May 2017, at 16:40 , Joris Meys <jorismeys at gmail.com> wrote:
> 
> And with "equally spaced" I obviously meant "of equal size". It's getting
> too hot in the office here...

We have a fair amount of cool westerly wind up here that I could transfer to you via  WWTP (Wind and Weather Transport Protocol). If you open up a sufficiently large pipe, that is. 

Anyways, in the past we have tried to follow Tukey's instructions on details like the definition of the "hinges" on boxplots, so presumably we should try and do likewise for this case. 

I suspect that Tukey would say "divide the data into three roughly equal-sized groups" or some such. The obvious thing to do would be to allocate N %/% 3 to each group and then the N %% 3 remaining symmetrically and as evenly as possible, which in my book would rather be (1,0,1) than (0, 2, 0) for the case N %% 3 == 2. If  N %% 3 == 1, there is no alternative to (0, 1, 0) by this logic.

> 
> On Wed, May 31, 2017 at 4:39 PM, Joris Meys <jorismeys at gmail.com> wrote:
> 
>> Seriously, if a method gives a wrong result, it's wrong. line() does NOT
>> implement the algorithm of Tukey, even not after the patch. We're not
>> discussing Excel here, are we?
>> 
>> The method of Tukey is rather clear, and it is NOT using the default
>> quantile definition from the quantile function. Actually, it doesn't even
>> use quantiles to define the groups. It just says that the groups should be
>> more or less equally spaced. As the method of Tukey relies on the medians
>> of the subgroups, it would make sense to pick a method that is
>> approximately unbiased with regard to the median. That would be type 8
>> imho.
>> 
>> To get the size of the outer groups, Tukey would've been more than happy
>> enough with a:
>> 
>>> floor(length(dfr$time) / 3)
>> [1] 6
>> 
>> There you have the size of your left and right group, and now we can
>> discuss about which median type should be used for the robust fitting.
>> 
>> But I can honestly not understand why anyone in his right mind would
>> defend a method that is clearly wrong while not working at Microsoft's
>> spreadsheet department.
>> 
>> Cheers
>> Joris
>> 
>> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at insa-toulouse.fr>
>> wrote:
>> 
>>> Le 31/05/2017 à 15:40, Joris Meys a écrit :
>>> 
>>>> OTOH,
>>>> 
>>>>> sapply(1:9, function(i){
>>>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>>>> + })
>>>> [1] 8 8 6 6 6 6 8 6 6
>>>> 
>>>> Only the default (type = 7) and the first two types give the result
>>>> lines() gives now. I think there is plenty of reasons to give why any of
>>>> the other 6 types might be better suited in Tukey's method.
>>>> 
>>>> So to my mind, chaning the definition of line() to give sensible output
>>>> that is in accordance with the theory, does not imply any inconsistency
>>>> with the quantile definition in R. At least not with 6 out of the 9
>>>> different ones ;-)
>>>> 
>>> Nice shot.
>>> But OTOE (on the other end ;)
>>>> sapply(1:9, function(i){
>>> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>>> + })
>>> [1] 8 8 8 8 6 6 8 6 6
>>> 
>>> Here "8" gains 5 votes against 4 for "6". There were two defector methods
>>> that changed the point number and should be discarded. Which leaves us
>>> with the score 3:4, still in favor of "6" but the default method should
>>> prevail
>>> in my sens.
>>> 
>>> Serguei.
>>> 
>> 
>> 
>> 
>> --
>> Joris Meys
>> Statistical consultant
>> 
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Mathematical Modelling, Statistics and Bio-Informatics
>> 
>> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>> 
> 
> 
> 
> -- 
> Joris Meys
> Statistical consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
> 
> tel :  +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list