[R] generalized linear model (glm) and "stepAIC"

Sat Jul 11 15:50:43 CEST 2009

Simone Santoro wrote:
>> Simone Santoro wrote:
>>> 
>>> I have 1! 2 response variables (species growth rates) and two 
>>> envir onmental factors that I want to test to find out a possible
>>>  relation.
>>> 
>>> The sample size is quite small: (7<n<12, depending on each
>>> species-case).
>>> 
>>> I performed a Shapiro test (shapiro.test) to test for normal 
>>> distribution of the responses variables and they were normally 
>>> distribuited 10 times (over 12 possible, i.e. 12 response
>>> variables).
>>> 
>> The Shapiro test is probably not very powerful for such a small 
>> data set -- i.e., the data could be non-normal (in fact it almost 
>> certainly *is* non-normal) but the deviation is not detectable ...
>> where do your growth rates come from? Can you make a guess at their
>> probable distribution?
>> 
>> 
>>>> The growth rates are calculated as  ΔXt, where ΔXt = (Xt + 1) -
>>>> Xt , Xt is loge (Nt), and Nt is the population size at time t. 
>>>> I use it and not directly population size because I found in a
>>>> few cases (species population size trend) the existence of
>>>> autocorrelation (time lag= >>> 1), nevertheless the "ΔXt"didn't
>>>> >>> show autocorrelation and was equivalent to my purpose:
>>>> investigating if "x1" or "x2" affected to the population
>>>> dynamic of these species. I would expect that "ΔXt" would be
>>>> normally >>> distribuited.

  This seems perfectly reasonable.

>> Why different procedures for different cases?
>> 
>> 
>>>> I don't understand if you are suggesting to me to use different
>>>> procedures for different cases or if you are asking me why! I
>>>> used different procedures for different cases, in such case: I
>>>> >>> didn't.

  I thought you said you tried models containing both x1 and x2 for 6 of
the cases and just x2 for the other 6. Maybe I was confused -- maybe
you were stating the results.
>> You would probably be better off just doing summary() and looking
>> at the p-values of the two predictors (if you must ...)
>> 
>> Why are you using AIC if you! 're interested in testing 
>> relationships rather than prediction ?
>> 
>> 
>>>> So, by reading the Whittingham et al. paper (thank you very
>>>> much) and reading your commentaries I undertand I would be
>>>> better off using the "full" (just two predictors) model and >>>
>>>> taking in account the p-values of such a model (not using the
>>>> stepAIC procedure), isn't it?

  yes.

>> 
>>> THE QUESTIONS:
>>> 
>>> 1) Can I trust in the existence of such statistical relation? I
>>> mean: is there a way to know the power of this test in R?
>>> 
>> There are power tests in R, but I don't know if there are any
>> specifically for this case (two-predictor regression). Remember
>> that power applies to the probability of type II (falsely failing
>> to reject null hypothesis) errors.
>> 
>> 
>>>> Ok, on the other hand, I suppose that the small sample size
>>>> makes the existence of a statistical relation between the
>>>> predictor and the! response variable even more reliable, isn't
>>>> it?

  Actually, it means that you will only be able to detect large effects.
If the estimated effects are larger than seem sensible, then they
are quite likely spurious:

http://www.stat.columbia.edu/~gelman/research/published/power4r.pdf

>> 
>>> 2) I decided to use always "family=gaussian" because I have also 
>>> negative values in my response variable and I cannot manage it in
>>> a different way. In fact I was not able to use as link function a
>>>  "negative binomial" as I previously did in SAS because of
>>> negative values of response variable (as R "told" me when I
>>> tried)
>>> 
>> Is this a question? As above, glm() with gaussian family and 
>> identity (default) link is equivalent to lm().
>> 
>> 
>>>> Yes, now I understand and I shame because I'm aware it is a
>>>> very basic statistical issue (I'm sorry!). But, if I strongly
>>>> believe the response variable is normally distribuited,
>>>> although >>> the small sample size makes difficult to test its
>>>> normality, can I use lm() without testing for normality? In
>>>> other words: can I trust on logical basis t! hat the
>>>> statistical population beyond >>> the sample would be normally
>>>> distribuited and consequently using lm()?

   I would say so.

  Ben Bolker

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090711/c53c2629/attachment-0002.bin>