[R] Error distribution question

Fri Mar 9 00:44:40 CET 2007

> > I was wondering if somebody could offer me some advice on which
> > error distribution would be appropriate for the type of data I have.
> > I'm studying what continuous predictor variables such as grooming
> > received, rank, etc. affect the amount of grooming given. This
> > response variable is continuous with many zeros, and so positively
> > skewed.
>
> This kind of variable is very common in prospecting (oil, mining)
> industries, and also in medical research. It's neither continuous
> nor discrete, because of the weight on zero. Basically, it is a
> combination of _two_ variables:
>
> X: a Bernoulli trial, such that p(X = 0) = 1 - p (failure) and
>    p(X = 1) = p (success)
>
> Y: the continous variable that represents numerically the success
>
> So, we have the final variable as X * Y.

Indeed, the Tweedie distribution may be just what you are 
after.

> I realized in the Tweedie help page that one can use a specific response
> distribution  (Normal, Poisson, Compound Poisson, etc) by setting the
> variance power =  to a specific number. I'm a beginner, so I really don't
> follow then,  

This sounds like you have the  tweedie  package.

And yes, the variance.power tells you which distribution you have.
Tweedie distributions have a variance of the form var[Y] = phi * mu^p
for some variance.power  p.  (Note Tweedie distns belong to the
exponential family, so can be used in the generalized linear model
framework.)

The mixed distributions you talk about (continuous, plus a positive
mass at zero) correspond to tweedie distributions with 1 < p < 2.
(p=2 is the gamma; p=0 is Normal; p=3 is inverse Gaussian; p=1
and phi=1 is Poisson).

> which response distribution to use (i.e. what variance power) that would 
> be appropriate for continuous response data with many zeros. 

If you want to use a tweedie distn in practice, you first need to know
*which* Tweedie distribution you need; that is, what value of p is
appropriate.  To do that, use the  tweedie.profile function in
package  tweedie.  tTat will tell you what value of p is approprioate
for your data.  For the sake of an example, suppose you wish to fit
a model something like  Y ~ x1  + x2; use  tweedie.profile
and you get p = 1.6:

tweedie.profile(Y ~ x1 + x2, p.vec=seq(1.1, 1.9, length=10), 
	do.plot=TRUE)

Then, you can fit the appropriate generalized linear model if you wish
as follow, using package  statmod:

glm( Y ~ x1 + x2, family=tweedie(variance.power=1,.6, link.power=0)

(link.power=0 means a log, and is a commonly used link.)

Hope that's of some help.

P.
-- 
Dr Peter Dunn  |  dunn <at> usq.edu.au
Faculty of Sciences, USQ; http://www.sci.usq.edu.au/staff/dunn
Aust. Centre for Sustainable Catchments: www.usq.edu.au/acsc

This email (including any attached files) is confidential an...{{dropped}}