[R] Prediction with multiple zeros in the dependent variable

Berton Gunter gunter.berton at gene.com
Thu Sep 8 17:21:28 CEST 2005


1. As George Box long ago emphasized and proved, normality is **NOT** that
important in regression, certainly not for estimation and not even for
inference in balanced designs. Independence of the observations is far more

2. That said, it sounds like what you have here is a mixture of some sort.
Before running off to do fancy modeling, I would work very hard to look for
some kind of "lurking variable" or experimental aberration -- what was going
on in the experiment or study that might have caused all the zeros? Was
there an instrument problem? -- a bad reagent? -- improper handling of the
samples? It might very well be that you need to throw away part of the data
because it's useless, rather than artificially attempt to model it.

3. And having said that, if a comprehensive model IS called for, one rather
cynical approach to take is just to add a grouping variable as a covariate
that has a value of 1 for all data in the zero group and 2 for all the
nonzero data. Your model is f(age,sex) = 0 for all data in group 1 and your
linear or nonlinear regression for group 2. Of course, this merely cloaks
the cynicism in respectable dress. It's hard for me to believe that it was
Mother Nature and not some kind of experimental problem that you see. 

A slightly less cynical approach might be to use some sort of changepoint
model (in both age and sex) of the form f(age, sex) = g(age,sex) for age>=k1
and sex <=k2 and h(age,sex) otherwise. Well, perhaps **not** less cynical --
the response data are so widely separated that you'll just be using a bunch
of extra (nonlinear, incidentally) parameters to essentially reproduce the
use of a covariate.

So I guess the point is that unless you already have a previously developed
nonlinear model that could explain the behavior you see (perhaps based on
some kind of mechanistic reasoning) it's not a good idea to try to develop
an artificial empirical model that comprehends all the data. The fact is (a
horrible phrase) that no modeling at all is needed for the most important
message the data have to convey: rather, focus on the cause of the message
instead of statistical artifice. Once you have determined that, you may be
able to do something sensible. Clear thinking trumps muddy modeling every

(Hopefully, this is sufficiently inflammatory that others will vigorously
and wisely dispute me).


-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of John Sorkin
> Sent: Wednesday, September 07, 2005 9:06 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Prediction with multiple zeros in the dependent variable
> I have a batch of data in each line of data contains three values,
> calcium score, age, and sex. I would like to predict calcium 
> scores as a
> function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
> calcium scorers have a very "ugly distribution". There are multiple
> zeros, and multiple values between 300 and 600. There are no values
> between zero and 300. Needless to say, the calcium scores are not
> normally distributed, however, the values between 300 and 600 have a
> distribution that is log normal. As you might imagine, the residuals
> from the regression are not normally distributed and thus violates the
> basic assumption of regression analyses. Does anyone have a suggestion
> for a method (or a transformation) that will allow me predict calcium
> from age and sex without violating the assumptions of the model?
> Thanks,
> John
> John Sorkin M.D., Ph.D.
> Chief, Biostatistics and Informatics
> Baltimore VA Medical Center GRECC and
> University of Maryland School of Medicine Claude Pepper OAIC
> University of Maryland School of Medicine
> Division of Gerontology
> Baltimore VA Medical Center
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> 410-605-7119 
> jsorkin at grecc.umaryland.edu
> 	[[alternative HTML version deleted]]
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html

More information about the R-help mailing list