[R] Prediction with multiple zeros in the dependent variable

Thomas Lumley tlumley at u.washington.edu
Thu Sep 8 16:22:32 CEST 2005

On Thu, 8 Sep 2005, John Sorkin wrote:
> I have a batch of data in each line of data contains three values,
> calcium score, age, and sex. I would like to predict calcium scores as a
> function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
> calcium scorers have a very "ugly distribution". There are multiple
> zeros, and multiple values between 300 and 600. There are no values
> between zero and 300. Needless to say, the calcium scores are not
> normally distributed, however, the values between 300 and 600 have a
> distribution that is log normal.

[Coronary artery calcium by EBCT, I presume]

Our approach to modelling calcium scores is to do it in two parts.  First 
fit something like a logistic regression model where the outcome is zero 
vs non-zero calcium.  Then, for the non-zero use something like a linear 
regression model for log calcium.

You could presumably use such a model for prediction or imputation too, 
and you can work out means, medians etc from the two models.

One particular reason for using this two-part model is that we find 
different predictors of zero/non-zero and of amount. This makes biological 
sense -- a factor that makes arterial plaques calcify might well have no 
impact until you have arterial plaques.

Or you could use smooth quantile regression in the rq package.


More information about the R-help mailing list