[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol sokol at insa-toulouse.fr
Thu Jun 1 11:40:29 CEST 2017


Le 31/05/2017 à 22:00, Martin Maechler a écrit :
>>>>>> Serguei Sokol <sokol at insa-toulouse.fr>
>>>>>>      on Wed, 31 May 2017 18:46:34 +0200 writes:
>      > Le 31/05/2017 à 17:30, Serguei Sokol a écrit :
>      >>
>      >> More thorough reading revealed that I have overlooked this phrase in the
>      >> line's doc: "left and right /thirds/ of the data" (emphasis is mine).
>      > Oops. I have read the first ref returned by google and it happened to be
>      > tibco's doc, not the R's one. The layout is very similar hence my mistake.
>      > The latter does not mention "thirds" but ...
>      > Anyway, here is a new line's patch which still gives a result slightly different
>      > form MMline(). The slope is the same but not the intercept.
>      > What are the exact terms for intercept calculation that should be implemented?
>
>      > Serguei.
>
> Sorry Serguei,   I have new version of line.c  since yesterday,
> and will not be disturbed anymore.
>
> Note that I *did* give the litterature, and it seems most
> discussants don't have paper books in physical libraries anymore;
> In this case, interestingly, you need one of those I think -
> almost everything I found online did not have the exact details.
Fortunately, you keep old good habits regarding paper books ;)

> Peter Dalgaard definitely was right that Tukey did not use
> quantiles at all, and notably did *not* define the three groups
> via   {i;  x_i <= x_L}  and {i; x_i >= X_R}  which (as I think
> you noticed) may make the groups quite unbalanced in case of duplicated x's.
>
> But then, for now I had decided to fix the bug (namely computing
> the x-medians wrongly as you diagnosed correctly(!) -- but your
> first 2 patches only fixed partly) *and* go at least one step in
> the direction of Tukey's original, namely by allowing iteration via a new 'iter' argument.
Hm, I did not use iterations. A newly introduced indx is used to keep
index permutation when x is sorted.

> I have also updated the help page to document what  line()  has
> been computing all these years {apart from the bug which
> typically shows for non-equidistant x[]}.
You mean "non equally sized"? (bis ;) )

> We could also consider to eventually add a new   'method = <string>'
> argument to line()  one version of which would continue to
> compute the current solution,
If the current solution is considered as plainly wrong, why to continue
to implement it? Unless "by current version" you mean your implementation
equivalent to my patch2 which fixes group sizes.

>   another would compute the one
> corresponding to Velleman & Hoaglin (1981)'s  FORTRAN
> implementation (which had to be corrected for some infinite-loop
> cases!)... not in the close future though
What would be the interest of this fortran version? Faster? More accurate?

> Given all this discussions here, I think I should commit what I
> currently have  ASAP.
+1.

Serguei.



More information about the R-devel mailing list