[R] Boxplot philosophy {was "Boxplot in R"}

Berton Gunter gunter.berton at gene.com
Tue Jul 12 00:18:23 CEST 2005


I have been an enthusiastic user of boxplots for decades. Of course, the
issue of how to handle the whiskers ("outliers"] is a valid one, and indeed
sample size related. Dogma is always dangerous. I got to know John Tukey
somewhat (I used to chauffer him to and from meetings with a group of Merck
statisticians), and I,too,think he would have been the first to agree that
some flexibility here is wise. 

HOWEVER, the chief advantage of boxplots is their simplicity at displaying
simultaneously and easily **several** important aspects of the data, of
which outliers are probably the most problematic (as they often result in
severe distortion of the plots without careful scaling). Even with dozens of
boxplots, center, scale, and skewness are easy to discern and compare. I
think this would NOT be true of "violin" plots and other more complex
versions -- simplicity can be a virtue.

Finally, a tidbit for boxplot afficianados: how does one detect bimodality
from a boxplot?

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Ted Harding
> Sent: Monday, July 11, 2005 2:52 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] Boxplot philosophy {was "Boxplot in R"}
> On 11-Jul-05 Martin Maechler wrote:
> >>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
> >>>>>>     on Mon, 11 Jul 2005 03:04:44 +0100 writes:
> > 
> >     AdaiR> Just an addendum on the philosophical aspect of doing
> >     AdaiR> this.  By selecting the 5% and 95% quantiles, you are
> >     AdaiR> always going to get 10% of the data as "extreme" and
> >     AdaiR> these points may not necessarily outliers.  So when
> >     AdaiR> you are comparing information from multiple columns
> >     AdaiR> (i.e.  boxplots), it is harder to say which column
> >     AdaiR> contains more extreme value compared to others etc.
> > 
> > Yes, indeed!
> > 
> > People {and software implementations} have several times provided
> > differing definitions of how the boxplot whiskers should be defined.
> > 
> > I strongly believe that this is very often a very bad idea!!
> > 
> > A boxplot should be a universal mean communication and so one
> > should be *VERY* reluctant redefining the outliers.
> > 
> > I just find that Matlab (in their statistics toolbox)
> > does *NOT* use such a silly 5% / 95% definition of the whiskers,
> > at least not according to their documentation.
> > That's very good (and I wonder where you, Larry, got the idea of
> > the 5 / 95 %).
> > Using such a fixed percentage is really a very inferior idea to
> > John Tukey's definition {the one in use in all implementations
> > of S (including R) probably for close to 20 years now}.
> > 
> > I see one flaw in Tukey's definition {which is shared of course
> > by any silly "percentage" based ``outlier'' definition}:
> > 
> >    The non-dependency on the sample size.
> > 
> > If you have a 1000 (or even many more) points,
> > you'll get more and more `outliers' even for perfectly normal data.
> > 
> > But then, I assume John Tukey would have told us to do more
> > sophisticated things {maybe things like the "violin plots"} than
> > boxplot  if you have really very many data points, you may want
> > to see more features -- or he would have agreed to use 
> >    boxplot(*,  range = monotone_slowly_growing(n) )
> > for largish sample sizes n.
> > 
> > Martin Maechler, ETH Zurich
> I happily agree with Martin's essay on Boxplot philiosophy.
> It would cerainly confuse boxplot watchers if the interpretation
> of what they saw had to vary from case to case. The fact that
> careful (and necessarily detailed) explanations of what was
> different this time would be necessary in the text would not
> help much, and would defeat the primary objective of the boxplot
> which is to present a summary of features of the data in a form
> which can be grasped visually very quickly indeed.
> I'm sure many of us have at times felt some frustration at the
> rigidly precise numerical interpretations which Tukey imposed
> on the elements of his many EDA techniques; but this did ensure
> that the viewer really knew, at a glance, what he was looking at.
> EDA brilliantly combined several aspects of "looking at data":
> selection of features of the data; highly efficient encoding of
> these, and of their inter-relationships, into a medium directly
> adapted to visual perception; robustness (so that the perceptions
> were not unstable with respect to wondering just what the underlying
> distribution might be); accessibility (in the sense of being truly
> understood) to non-theoreticians; and capacity to be implemented on
> primitive information technology.
> Indeed, one might say that the "core team" of EDA consists of the
> techniques for which you need only pencil and paper.
> Nevertheless, Tukey was no rigid dogmatist. His objective was
> always to give a good representation of the data, and he would
> happily shift his ground, or adapt a technique (albeit probably
> giving it a different name), or devise a new one, if that would
> be useful for the case in hand.
> Best wishes to all,
> Ted.
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 11-Jul-05                                       Time: 22:19:47
> ------------------------------ XFMail ------------------------------
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html

More information about the R-help mailing list