[R] density of hist(freq = FALSE) inversely affected by data magnitude

Tue Jan 22 23:48:07 CET 2013

Hi,

I have a couple of observations, a question or two, and perhaps a
suggestion related to the plotting of density on the y-axis within the
hist() function when freq=FALSE.  I was using the function and trying
to develop an intuitive understanding of what the density is telling
me.  After reading through this fairly helpful post:

http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis

I finally realized that in the case where freq = FALSE, the y-axis
isn't really telling me the density.  It's actually indicating the
density multiplied by the bin size.  I assume this is for the case
where the bins may be of non-regular size.

from hist.default:

dens <- counts/(n * diff(breaks))

So the count in each bin is divided by the total number of
observations (n) multiplied by the size of the bin.  The problem, as I
see it, is that the density ends up being scaled by the size of the
bins, which is inversely proportional to the magnitude of the data.
Therefore the magnitude of the data is directly affecting the density,
which seems problematic.

For example*:

set.seed(4444)
x <- runif(100)
y <- x / 1000

par(mfrow = c(2, 1))
hist(x, prob = TRUE)
hist(y, prob = TRUE)

>From this example, you see that the density for the y histogram is
1000 times larger, simply because the y data is 1000 times smaller.
Again, that seems problematic.  It seems to me, that the density
should be unit-less, but here it's affected by the magnitude of the
data.

So, my question is, why is density calculated this way?

For the case where all the bins are of the same size, I would think
density should simply be calculated as:

dens <- counts / n

Of course, that might be somewhat misleading for the case where the
bin sizes vary.  So then why not calculate density as:

dens <- counts / (n * diff(breaks) / min(diff(breaks)))

Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
of the magnitude of the data, and simply leaves the relative
difference in bin size.

For the case where all the bins are the same size, the calculation is
equivalent to dens <- counts / n

For all other cases, the density is scaled by the size of the bin, but
unaffected by the magnitude of the data.

So, what am I misunderstanding?  Why is density calculated as it is,
and what does it mean?

Thanks,

James

*example from http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis