[R] density of hist(freq = FALSE) inversely affected by data magnitude

William Dunlap wdunlap at tibco.com
Wed Jan 23 00:33:24 CET 2013


The probability density function is not unitless - it is the derivative of the
[cumulative] probability distribution function so it has units delta-probability-mass
over delta-x.  It must integrate to 1 (over the all possible x).  hist(freq=FALSE,x)
or hist(prob=TRUE,x) displays an estimate of the density function and the following
example shows how the scale matches what you get from the presumed 
population density function.

> f
function (n, sd) 
{
    x <- rnorm(n, sd = sd)
    hist(x, freq = FALSE) # estimated density
    s <- seq(min(x), max(x), len = 129)
    lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample
}
> f(1e6, sd=1)
> f(100, sd=1)
> f(100, sd=0.0001)
> f(1e6, sd=0.0001)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of J Toll
> Sent: Tuesday, January 22, 2013 2:48 PM
> To: r-help
> Subject: [R] density of hist(freq = FALSE) inversely affected by data magnitude
> 
> Hi,
> 
> I have a couple of observations, a question or two, and perhaps a
> suggestion related to the plotting of density on the y-axis within the
> hist() function when freq=FALSE.  I was using the function and trying
> to develop an intuitive understanding of what the density is telling
> me.  After reading through this fairly helpful post:
> 
> http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-
> with-a-relative-frequency-axis
> 
> I finally realized that in the case where freq = FALSE, the y-axis
> isn't really telling me the density.  It's actually indicating the
> density multiplied by the bin size.  I assume this is for the case
> where the bins may be of non-regular size.
> 
> from hist.default:
> 
> dens <- counts/(n * diff(breaks))
> 
> So the count in each bin is divided by the total number of
> observations (n) multiplied by the size of the bin.  The problem, as I
> see it, is that the density ends up being scaled by the size of the
> bins, which is inversely proportional to the magnitude of the data.
> Therefore the magnitude of the data is directly affecting the density,
> which seems problematic.
> 
> For example*:
> 
> set.seed(4444)
> x <- runif(100)
> y <- x / 1000
> 
> par(mfrow = c(2, 1))
> hist(x, prob = TRUE)
> hist(y, prob = TRUE)
> 
> >From this example, you see that the density for the y histogram is
> 1000 times larger, simply because the y data is 1000 times smaller.
> Again, that seems problematic.  It seems to me, that the density
> should be unit-less, but here it's affected by the magnitude of the
> data.
> 
> So, my question is, why is density calculated this way?
> 
> For the case where all the bins are of the same size, I would think
> density should simply be calculated as:
> 
> dens <- counts / n
> 
> Of course, that might be somewhat misleading for the case where the
> bin sizes vary.  So then why not calculate density as:
> 
> dens <- counts / (n * diff(breaks) / min(diff(breaks)))
> 
> Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
> of the magnitude of the data, and simply leaves the relative
> difference in bin size.
> 
> For the case where all the bins are the same size, the calculation is
> equivalent to dens <- counts / n
> 
> For all other cases, the density is scaled by the size of the bin, but
> unaffected by the magnitude of the data.
> 
> So, what am I misunderstanding?  Why is density calculated as it is,
> and what does it mean?
> 
> Thanks,
> 
> 
> James
> 
> 
> *example from http://stats.stackexchange.com/questions/17258/odd-problem-with-a-
> histogram-in-r-with-a-relative-frequency-axis
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list