[R] random truncation

Sun Jul 14 04:04:29 CEST 2019

PLEASE EXCUSE:  This discussion has diverged from R into discussing the 
precise assumptions seemingly descriptive of an application that drove 
the initial post to this thread.  A reply by Abby Spurdle seemed to  me 
to raise questions, whose answers may not be intelligible without 
material snipped from Spurdle's reply.  I wish to thank Spurdle from the 
reply and apologize to those who feel this is an abuse of this list.  I 
trust that those in the latter category will please not bother to read 
further.  For anyone still interested in this problem, below please find 
my earlier analysis with corrections that attempt to respond to 
Spurdle's most recent concerns.  Thanks, Spencer Graves

On 2019-07-12 22:31, Abby Spurdle wrote:
> > The distribution of the randomly truncated variable has thus four
> > parameters: a, b, mu and sigma.  I was able to write down the likelihood
> > and attempted to maximise it
>
> I read the Wikipedia article more carefully.
> The formula is relatively simple, and is based on the application of 
> Bayes Theorem.
> If one doesn't want to work out the integral, numerical methods can be 
> used.
>
> However, the problem needs to be defined *precisely* first.
>

Correct:  In my case, I confess I hadn't thought this through completely 
before posting.  I tried Rseek, as Bert Gunter suggested.  That led me 
to the "truncreg" and "DTDA" packages, neither of which seemed to be 
what I wanted;  thanks to Bert, Rolf, and Abby for your comments.

       I'm observing a random variable Y[i] = (x[i]'b+e[i]) given 
Y[i]>(z[i]'c+f[i]) where the tick mark (') denotes transpose of a 
vector, and e and f are normally distributed with mean 0 and standard 
deviations s and t, respectively, i = 1:n.  Thus, Y[i] follows a 
truncated normal distribution with mean x[i]'b and standard deviation s, 
with the truncation condition being that Y[i]>(z[i]'c+f[i]).

       I want the total of all the Y's from the untruncated 
distribution, i.e., including those truncated (and not observed).

       I think the likelihood is the product of the density of Y[i] 
given x[i] and given that Y[i] is actually observed.  By substituting 
Y[i] = (x[i]'b+e[i]) into the truncation condition Y[i]>(z[i]'c+f[i]), 
we get the following:

             (x[i]'b+e)>(z[i]c+f).

This occurs if and only if:

             (x[i]'b-z[i]'c)>(f-e),

Therefore, the probability that Y[i] is observed (and not truncated) is

             Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))

where Phi is the cdf of the standard normal.

       And then the likelihood for observation i can be written as follows:

             f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) / 
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)).

       We may not be able to estimate "c" in this, because if one of the 
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf.  That makes the 
denominator 0 and the likelihood Inf.  (If all the z[i]'s are zero, we 
still cannot estimate "c".)  However, if "b" is estimable, ignoring the 
truncation, then we can estimate "b", "s" and "t" given "c".

       And then the desired total of all the Y's, observed and 
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.

       This likelihood is simple enough, it can be easily programmed in 
R and maximized over variations in "b", "s" and "t" given "c".  I can 
get starting values for "b" and "s" from "lm", ignoring the truncation.  
And I can first fit the model assuming t = s, then test whether it's 
different using likelihood ratio.  And I can try to estimate "c", but I 
should probably use values I can estimate from other sources until I'm 
comfortable with the estimates I get for "b", "s" and "t" given an 
assumed value for "c".

       Comments?
       Thanks so much.
       Spencer Graves