[R] histograms

Tue Jun 8 13:43:54 CEST 1999

On 08-Jun-99 D.A.Wooff at durham.ac.uk wrote:
> 
> I hope there are many of us that agree 100% with Bill. Bad practice,
> as enshrined in the default behaviour of histogram, should be
> discouraged.  We should aim to introduce density-based histograms from
> the outset, and the default behaviour of histograms in many packages
> acts against this principle. The current default behaviour conveys a
> misleading and arguably useless summary and I don't go with the
> argument that we should persist with it because it is simple to
> understand where the numbers come from.

What's going on? There's NOTHING wrong with a histogram as such.
"Bad practice, as enshrined in the default behaviour of histogram";
"The current default behaviour conveys a misleading and arguably useless
summary"; -- I respectfully disagree. Aka b****cks.

If the histogram bin size matches the discretization of the data,
then the histogram is equivalent to the data but simply represents
it differently. What's wrong with that?

If the bin size is coarser, then some information is lost of course.
But the nature of the loss (no discrimination within bins) is well
defined and unambiguous, and there is no interference between
different bins. What (apart from the loss of this specific info)
is wrong with that?

I recently had some data of which I did histos with bin-size equal to
data resolution. The following leapt to the eye (summarised in tabular
form):

X: 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 etc

N: 856   0 730   0   0 723   0 584   0   0 425   1 319   0   0 220 etc

Misleading and useless? Highly informative, according to me; and
I probably would not have noticed it so readily without looking at the
histogram. A density estimate would have made a real mush of it.
A histogram binned to width 0.2 would have completely (but cleanly)
concealed 90 per cent of it: the 10 per cent being the zero count
for 2.8-2.9, 3.8-3.9, ... so in the end I would have done a raw histo
anyway!

Density estimates also lose information. Of course the nature of the
loss is, theoretically, described in the definition of the smoothing
procedure. But in practice it's far more difficult to hypothesise
what may underlie a quirk in a density estimate, because of the
interference between neighbouring data values.

Density estimates have the merit of producing pictures which are much
more suggestive of a continuously varying probability density curve. In
some cases this may be usefully informative; in particular the desnity
estimate is sensitive to any variation in data value. In other cases it
may be merely cosmetic. In the worst cases it may give a seriously
misleading impression (as of course histograms also could).

Both methods have their uses, their (somewhat complementary) merits,
and their (somewhat complementary) demerits. As usual, it's horses
for courses.

But, specifically (as I said to start with): There's NOTHING wrong with
histograms as such.

I don't understand why people suggest that there is. There may, however,
be something seriously wrong with the way many people interpret them, or
with the uses that software packages make of them. But those are
different -- and possibly much more appropriate -- targets.

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Date: 08-Jun-99                                       Time: 12:43:54
------------------------------ XFMail ------------------------------
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._