[R] Linear regression with a rounded response variable

Thu Oct 22 04:00:10 CEST 2015

Dear Peter, Charles, Gabor, Jim, Mark, Victor, Peter, and Harold,
You have given me plenty of ammunition.  Thank you very much for the useful answers. 
Gratefully,
Ravi
________________________________________
From: peter dalgaard <pdalgd at gmail.com>
Sent: Wednesday, October 21, 2015 8:11 PM
To: Charles C. Berry
Cc: Ravi Varadhan; r-help at r-project.org
Subject: Re: [R] Linear regression with a rounded response variable

> On 21 Oct 2015, at 19:57 , Charles C. Berry <ccberry at ucsd.edu> wrote:
>
> On Wed, 21 Oct 2015, Ravi Varadhan wrote:
>
>> [snippage]
>
> If half the subjects have a value of 5 seconds and the rest are split between 4 and 6, your assertion that rounding induces an error of dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second group and more negative errors in the 4 second group under any plausible model).

Yes, and I think that the suggestion in another post to look at censored regression is more in the right direction.

In general, I'd expect the bias caused by rounding the response to quite small, except at very high granularity. I did a few small experiments with the simplest possible linear model: estimating a mean based on highly rounded data,

> y <- round(rnorm(1e2,pi,.5))
> mean(y)
[1] 3.12
> table(y)
y
 2  3  4  5
13 63 23  1

Or, using a bigger sample:

> mean(round(rnorm(1e8,pi,.5)))
[1] 3.139843

in which there is a visible bias, but quite a small one:

> pi - 3.139843
[1] 0.001749654

At lower granularity (sd=1 instead of .5), the bias has almost disappeared.

> mean(round(rnorm(1e8,pi,1)))
[1] 3.141577

If the granularity is increased sufficiently, you _will_ see a sizeable bias (because almost all observations will be round(pi)==3):

> mean(round(rnorm(1e8,pi,.1)))
[1] 3.00017

A full ML fit (with known sigma=1) is pretty easily done:

> library(stats4)
> mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5)))
> mle(mll,start=list(mu=3))

Call:
mle(minuslogl = mll, start = list(mu = 3))

Coefficients:
      mu
3.122069
> mean(y)
[1] 3.12

As you see, the difference is only 0.002.

A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean)

> summary(r[1,]-r[2,])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-0.004155  0.000702  0.001495  0.001671  0.002554  0.006860

so the corrections relative to the crude mean stay within one unit in the 2nd place. Notice  that the corrections are pretty darn close to cancelling out the bias.

-pd

>
>
> HTH,
>
> Chuck
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com