[R] glm or transformation of the response?

joris meys jorismeys at gmail.com
Tue Nov 25 21:10:16 CET 2008


I reread the question, and I don't actually see the problem too much.
If you plot the transformed dataset from most of your models, the
problem is quite obvious : if you transform your predictor to the
logscale, the result of a a linear regression on those outcomes will
be naturally an exponential function. Exact the same thing happens if
you use the poisson family in a glm.

Now, there is no reason whatsoever that any exponential model would
fit the data, in contrary. The way your data is constructed, is
essentially linear. Indeed, in the model, the E(Y) for predictors 1:40
is 1:40, and you would expect a model with intercept 0 and slope 1,
whatever your variance does. Actually, the first model fits the data
far better than the second one, contrary to your beliefs. On the
second one, there is a clear trend in the residuals, indicating an
obvious deviation from the true model.

Why model 7 fits equally well, is because it is essentially exactly
the same as the classic linear model. The slight deviation comes from
the fact that lm uses least squares estimates, and glm uses maximum
likelihood.

The problem you create with heteroscedasticity is not one of fitting,
but one of confidence intervals for your estimated parameters. So my
suggestion to any student faced with this question would be to take a
look at the estimates for the variance, and the derived confidence
intervals, and try to find a more robust way of determining the
confidence intervals on the estimated parameters. Robust regression
might be an outcome here.

Kind regards
Joris



On Tue, Nov 25, 2008 at 3:52 PM, Christoph Scherber
<Christoph.Scherber at agr.uni-goettingen.de> wrote:
> Dear all,
>
> For an introductory course on glm?s I would like to create an example to
> show the difference between glm and transformation of the response. For
> this, I tried to create a dataset where the variance increases with the mean
> (as is the case in many ecological datasets):
>
>        poissondata=data.frame(
>        response=rpois(40,1:40),
>        explanatory=1:40)
>
>        attach(poissondata)
>
> However, I have run into a problem because it looks like the lm model (with
> sqrt-transformation) fits the data best:
>
> ##
>
> model1=lm(response~explanatory,poissondata)
> model2=lm(sqrt(response+0.5)~explanatory,poissondata)
> model3=lm(log(response+1)~explanatory,poissondata)
> model4=glm(response~explanatory,poissondata,family=poisson)
> model5=glm(response~explanatory,poissondata,family=quasipoisson)
> model6=glm.nb(response~explanatory,poissondata)
> model7=glm(response~explanatory,quasi(variance="mu",link="identity"))
>
>
> plot(explanatory,response,pch=16)
> lines(explanatory,predict(model1,explanatory=explanatory))
> lines(explanatory,(predict(model2,explanatory=explanatory))^2-0.5,lty=2)
> lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3, col="green")
> lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,col="red")
> lines(explanatory,predict(model6,explanatory=explanatory,type="response"),lty=1,col="blue")
> lines(explanatory,predict(model7,explanatory=explanatory,type="response"),lty=1,col="green")
>
> ##
>
> The only model that performs equally well is model7.
>
> How would you deal with this kind of analysis? What would be your
> recommendation to the students, given the fact that most of the standard glm
> models obviously don?t seem to produce good fits here?
>
> Many thanks and best wishes
> Christoph
>
> (using R 2.8.0 on Windows XP)
>
>
>
>
>
> --
> Dr. rer.nat. Christoph Scherber
> University of Goettingen
> DNPW, Agroecology
> Waldweg 26
> D-37073 Goettingen
> Germany
>
> phone +49 (0)551 39 8807
> fax   +49 (0)551 39 8806
>
> Homepage http://www.gwdg.de/~cscherb1
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list