[R] How to superimpose a histogram and density plot

Martin Maechler maechler at stat.math.ethz.ch
Tue Jun 8 10:24:49 CEST 1999


>>>>> "PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:

    PD> "Venables, Bill (CMIS, Cleveland)" <Bill.Venables at cmis.CSIRO.AU>
    PD> writes:
    >> The fact that every elementary book on statistics does it this way
    >> does not make it correct.  To be helpful, a histogram really has to
    >> be a non-parametric density estimator, period.
    >> 
    >> Enough already of polemics.

    PD> Not quite! There is a reason for doing it the other way, namely
    PD> that the concept of a histogram generally comes before the concept
    PD> of a probability density, pedagogically. It is very easy to explain
    PD> that you chop up the axis into bins and count the number of data
    PD> points that fall in each of them. I bet that half of the MDs that I
    PD> teach never quite understand the density (hell, the author of the
    PD> textbook I use managed to plot three identical gaussian curves with
    PD> identical y axis but different x axes... and he's a
    PD> statistician). So for the basic uses of the histogram, one would be
    PD> replacing a perfectly intuitive simple unit with a substantially
    PD> more complex one.

I agree 100% with Peter.  
Being a mathematician I agree with Bill that for us, a histogram is a
(very suboptimal) density estimate;  but average statistics software users
*do* learn histograms differently..  
-- quite a few ``learn'' histograms even before high-school...

    >> If you want a density estimate and a histogram 
    >> on the same scale, I suggest you try something like this:
    >> 
    >> > IQR <- diff(summary(data)[c(5,2)])

with R, the above line is superfluous:  

     1) IQR(.) is already an R function!
     2) density(.) in R *has* a reasonable default bandwidth (contrary to S),
        namely Silverman's rule of thumb

    >> > dest <- density(data, width = 2*IQR)  # or some smaller width, maybe,
    >> > hist(data, xlim = range(dest$x), xlab = "x", ylab = "density",
    >> +      probability = TRUE)    # <<<--- this is the vital argument
    >> > lines(dest, lty=2)

    PD> Yep. frequency=FALSE has the same effect and might be more logical,
    PD> since the y-axis is not really probability but "probability per x
    PD> unit".

which in sum leads to

    dest <- density(data)
    hist(data, xlim = range(dest$x), xlab = "x", ylab = "density", freq = FALSE)
    lines(dest, lty=2)
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list