[R] Adjusting for heaping in data

(Ted Harding) Ted.Harding at manchester.ac.uk
Sun Oct 14 12:39:51 CEST 2007


On 14-Oct-07 08:33:41, Thomas Frööjd wrote:
> Hi R users. I am new to the community and have got myself into
> a little problem.

It does not look as though it was yourself who got you into this
problem! You have been given the bathwater along with the baby.

> I have a dataset of birth weights recorded by nurses at a delivery
> clinic in an developing country.
> 
> The weights are entered in KiloGrams with one decimal. However
> there is substantial heaping at each 500g when looking at the
> sample in a histogram. Do anyone of you know a easy way to adjust
> for this and if it exists an R package to implement the method?
> 
> Best regards
> Thomas Fröjd

It is quite a common problem for data to be badly recorded in
this kind of way (as well as other bad kinds of ways).

You can't "adjust" for it (in the sense of "compensate") directly
since such a rounding does not tell where it was rounded from.
There may, howevr, be information in the covariates which could
be relevant to that question.

I'll comment on two extreme approaches and a possible intermediate
approach.

1) If you want to treat all data on the same footing, then
   you can round every weight to the nearest 500gm. This has
   the disadvantage of losing the information in the weights
   which have been recorded more precisely. The potential
   difference of up to 250gm, in a typical birth weight of
   say 2-2.5kgm, could result in a serious disstortion.

   However, you could assess the effect of this by performing
   your intended analysis using the data as you have them,
   the repeating it with the full-rounded data , and seeing
   how much difference it makes.

2) You could attempt to evaluate the extra uncertainty which
   results from this rounding which has been done by the nurses.

   One approach could be to fit a Normal distribution (say)
   to the data as you have them. Say this estimates mu0 for
   the mean and s0 for the stahdard deviation.

   You can then "un-round" the rounded data at random, on
   the basis that, given that a weight is say 2.5 kgm, it
   might be anywhere from 2.25 to 2.75 according to that
   distribution conditional on being in that range. This
   is quite easily done in R: if wt=2.5, say,

   p0  <- pnorm((wt - 0.25 - mu0)/s00
   p1  <- pnorm((wt + 0.25 - mu0)/s0)
   X   <- runif(1,p0,p1)
   rwt <- mu0 + s0*qnorm(X)
   rwt <- round(rwt,1) ## see below

   If you do this for every truly rounded 'wt', and perform
   you intended analysis for the resulting "un-rounded"
   dataset (of course after rounfing the results to 100gm,
   to be compatible with the 0.1kgm general rounding0, and
   then repeat this unrounding+analysis a few times, you
   will have an estimate of the ucertainty, in your final
   results, which has been introduced by the gross rounding.

   However, you will have to make a decision about what proportion
   of the data at each whole 500gm have really been rounded!

   Some of these are likely to be measurements which would have
   been quite appropriately rounded to the nearest 500gm -- e.g.
   2.05kgm -> 2.0kgm.

   You may be able to estimate this proportion from the heights
   of the "factory chimneys" in the histogram. Then apply the
   above procedure to that fraction.

3. If you have covariates with your weight data, you may be
   able to fit an appropriate model to your original data
   which would enable you to estimate, for any given "rounded"
   weight, the mu0 and s0 for that weight in terms of the
   values of the covariates. Then proceed as in (2).

   However, having done that, it may transpire that you should
   re-estimate the model, which would imply re-estimating the
   m0 and s0 used for the "random unrounding", and then going
   round the loop again. You're moving into Multiple Imputation
   territory now, and again there are resources in R for doing
   it; but it's deeper and more coplex territory!

In both (2) and (3), the same check as in (1) should be carried
out: Has it made any difference that matters to the results,
compared with what you get from the original data?

Hoping this helps (at least a bit).
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Oct-07                                       Time: 11:39:47
------------------------------ XFMail ------------------------------



More information about the R-help mailing list