[R] How to calculate errors in histogram values

Thu Nov 3 16:45:38 CET 2005

On 03-Nov-05 Kilian Hagemann wrote:
> Hi there,
> 
> I'm new to R but I thought this is the most likely place I
> could get advice or hints w.r.t the following problem:
> 
> I have a series of measurements xi with associated uncertainties dxi.
> I would like to construct the probability density histogram of this
> data where each density estimate has an associated error that is
> derived from the dxi.
> In other words, for large dxi the histogram should also display
> large uncertainties and vice versa. I need this for a curve fitting
> algorithm.
> 
> I have seen many crude ways of working out the error in each bin based
> on the bin count alone, but that's obviously independent of the dxi
> and thus not what I'm after.
> 
> So,
> 
> 1) Is there an R package that can do this (there's nothing in the
> refence of 
> 2.1.1)? If so, what algorithm does it use?
> 
> 2) Could anybody please point me in the right direction (papers, books,
> websites etc.)
> 
> Thanks,
> 
> -- 
> Kilian Hagemann

I don't know about an R package that would deal with this directly,
but I can think of an aproach, not difficult to implement in R,
which may be helpful.

I'm going to assume (at any rate for the time being), that you
are interested in a "per-bin" uncertainty, i.e. that you want
to be able to to answer "For each bin, what is the uncertainty
in the count for this bin, regardless of any other bins?"
I.e. you are ignoring the fact that there is correlation between
bins (what goes into one bin can not go into another).

1. Say you have N observations (i = 1:N). Draw a preliminary
   histogram, and from this decide on a good set of fixed breaks.

2. Extend this list by a few bins on either side (you may have
   to return to this point, depending on the outcome of later
   stages). Say this gives you K bins (j = 1:K).

3. For each data-point xi, with associated dxi, and for each bin j,
   use this to compute the "probability" pij that a point with
   mean xi and "error" dxi should fall into bin j. This might be
   based on something as naive as integrating a Normal distribution
   with mean xi and SD dxi over the range of the bin j.

4. You then have an array P = p[i,j], say, where in row i you
   have the computed probabilities for bins 1:K

5. Now: for bin j, you have an N-column of pij values.

   The expected number of the N which might "really" be in bin j
   is then

     Ej = sum(P[,j])

   and its variance (assuming that the "errors" in the xi are
   independent of each other) is

     Vj = sum(P[,j]*(1 - P[,j]))

6. So now you have the "bin that might have been", with expected
   value Ej and standard deviation Sj = sqrt(Vj). Now draw a
   "histogram" (you can use 'lines()' for this in R) with "bin
   heights" Ej, and "errors bars" +/- Sj.

   It is at this stage that you may have to go back to stage 2.
   In order to be sure that you will not overlook xi values that
   spill outside the range of the bins you chose in stage 2,
   you need to verify that the bins you are using extend beyond
   the range of the original data, and that the two end-bins
   have negligible E1 and EK.

7. NOTE that the Ej will in general be different from the counts
   in bin j from the original data. This is due to "overspill":
   if you have an original bin with a small count, which has next
   to it a bin with a large count, the uncertainty about whether
   some of the latter should really be in the former will contribute
   positively to the bin with the small count, by a larger amount
   than the bin with the small count will contribute to the bin
   with the large count.

   As you can see, this will have an effect of somewhat "flattening"
   the histogram, and of smoothing irregular variation from bin to
   bin.

This is just an outline of a possible approach, which you may be
able to develop to better suit your purposes if they are different
from what I've been assuming.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 03-Nov-05                                       Time: 15:45:35
------------------------------ XFMail ------------------------------