[R] boxplot question

Mon Nov 23 13:48:28 CET 2009

Antje wrote:
> Peter Ehlers wrote:
>> If there's been an answer to this, I've missed it.
>> Here's my take.
>>
>> Antje wrote:
>>> Hi there,
>>>
>>> I was wondering if anybody can explain to me why the boxplot ends up 
>>> with different results in the following case:
>>>
>>> I have some integer data as a vector and I compare the stats of 
>>> boxplot with the same data divided by a factor.
>>>
>>> I've attached a csv file with both data present (d1, d2). The factor 
>>> is 34.16667.
>>>
>>> If I run the boxplot function on d1 I get the following stats:
>>>
>>> 0.848...
>>> 0.907...
>>> 0.936...
>>> 0.965...
>>> 1.024...
>>>
>>> For d2 I get these stats:
>>>
>>> 29
>>> 31
>>> 32
>>> 33
>>> 36
>>>
>>>
>>> If I convert the stats of d1 with the factor, I get
>>>
>>> 29
>>> 31
>>> 32
>>> 33
>>> 35
>>>
>>> Obviously different for the upper whisker. But why???
>>>
>>> Antje
>>
>> Antje:
>>
>> Three comments:
>> 1. I think your 'factor' is actually 205/6, not 34.16667.
>>
>> 2. This looks like another case of FAQ 7.31:
>>
>> # Let's take your d2 and create d1; I'll call them x and y:
>> x <- rep(c(29:38, 40), c(7, 24, 50, 71, 24, 12, 14, 7, 13, 5, 1))
>> y <- x * 6 / 205
>>
>> # x is your d2, sorted
>> # y is your d1, sorted
>>
>> # The critical values are x[202:203] and y[202:203];
>> x[201:204]
>> #[1] 35 35 36 36
>>
>> # The boxplot stats are:
>> sx <- boxplot.stats(x)$stats
>> sy <- boxplot.stats(y)$stats
>>
>> # Calculate potential extent of upper whisker:
>> ux <- sx[4] + (sx[4] - sx[2]) * 1.5  #36
>> uy <- sy[4] + (sy[4] - sy[2]) * 1.5  #1.053658536585366
>>
>> # Is y[203] <= uy?
>> y[203] <= uy
>> #[1] FALSE  #!!!
>>
>> y[202] <= uy
>> #[1] TRUE
>>
>> # For x:
>> x[203] <= ux
>> #[1] TRUE
>>
>> And there's your answer: for y the whisker
>> goes to y[202], not y[203], due to the inevitable
>> imprecision in machine calculation.
>>
>> 3. last comment: I would not use boxplots for data like this.
>>
>>  -Peter Ehlers
>>
>>
> Hi Peter,
> 
> thanks a lot for your explanation! Now I understand the difference. I 
> was using the boxplot statistic to filter outliers from my data. Do you 
> have any suggestion for me what to use instead? (I tried to improve the 
> estimation of mean and sd, when iteratively removing outliers by boxplot 
> stats...)
> 
> Antje
> 
> 
Removing outliers to 'improve ...' is always problematic.
Perhaps you should not use mean or sd? Consider robust alternatives,
e.g. median/IQR. This very much depends on the purpose of the
analysis. See the taskview on Robust Statistical Methods.

For outliers, there's pkg:outliers. I haven't used it.
There seems to be quite a bit more: I got 277 hits from:

library(sos)
???"outlier"

  -Peter Ehlers