[R] Prediction with multiple zeros in the dependent variable

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Sep 8 14:24:51 CEST 2005

John Sorkin wrote:
> I have a batch of data in each line of data contains three values,
> calcium score, age, and sex. I would like to predict calcium scores as a
> function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
> calcium scorers have a very "ugly distribution". There are multiple
> zeros, and multiple values between 300 and 600. There are no values
> between zero and 300. Needless to say, the calcium scores are not
> normally distributed, however, the values between 300 and 600 have a
> distribution that is log normal. As you might imagine, the residuals
> from the regression are not normally distributed and thus violates the
> basic assumption of regression analyses. Does anyone have a suggestion
> for a method (or a transformation) that will allow me predict calcium
> from age and sex without violating the assumptions of the model?
> Thanks,
> John
> John Sorkin M.D., Ph.D.
> Chief, Biostatistics and Informatics
> Baltimore VA Medical Center GRECC and
> University of Maryland School of Medicine Claude Pepper OAIC

John - first I would try a proportional odds model, with zero as its own 
category then treating all other values as continuous or collapsing them 
into 20-tiles.  If the PO assumption happens to hold (look at partial 
residual plots) you have a simple solution.


Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list