[R] What is the most useful way to detect nonlinearity in lo

Sun Dec 5 14:17:21 CET 2004

(Ted Harding) wrote:
> On 05-Dec-04 Patrick Foley wrote:
> 
>>It is easy to spot response nonlinearity in normal linear
>>models using plot(something.lm).
>>However plot(something.glm) produces artifactual peculiarities
>>since the diagnostic residuals are constrained  by the fact
>>that y can only take values 0 or 1.
>>What do R users find most useful in checking the linearity
>>assumption of logistic regression (i.e. log-odds =a+bx)?
>>
>>Patrick Foley
>>patfoley at csus.edu
> 
> 
> The "most useful way to detect nonlinearity in logistic
> regression" is:
> 
> a) have an awful lot of data
> b) have the x (covariate) values judiciously placed.
> 
> Don't be optimistic about this prohlem. The amount of
> information, especially about non-linearity, in the binary
> responses is often a lot less than people intuitively expect.
> 
> This is an area where R can be especially useful for
> self-education by exploring possibilities and simulation.
> 
> For example, define the function (for quadratic nonlinearity):
> 
>   testlin2<-function(a,b,N){
>     x<-c(-1.0,-0.5,0.0,0.5,1.0)
>     lp<-a*x+b*x^2; p<-exp(lp)/(1+exp(lp))
>     n<-N*c(1,1,1,1,1)
>     r<-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]),
>          rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]),
>          rbinom(1,n[5],p[5])
>         )
>     resp<-cbind(r,n-r)
>     X<-cbind(x,x^2);colnames(X)<-c("x","x2")
>     summary(glm(formula = resp ~ X - 1,
>             family = binomial),correlation=TRUE)
>   }
> 
> This places N observations at each of (-1.0,0.5,0.0.5,1.0),
> generates the N binary responses with probability p(x)
> where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression
> forcing the "intercept" term to be 0 (so that you're not
> diluting the info by estimating a parameter you know to be 0),
> and returns the summary(glm(...)) from which the p-values
> can be extracted:
> 
> The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]}
> 
> You can run this function as a one-off for various values of
> a, b, N to get a feel for what happens. You can run a simulation
> on the lines of
> 
>   pvals<-numeric(1000);
>   for(i in (1:1000)){
>     pvals[i]<-testlin2(1,0.1,500)$coefficients[2,4]
>   }
> 
> so that you can test how often you get a "significant" result.
> 
> For example, adopting the ritual "sigificant == P<0.05,
> power = 80%", you can see a histogram of the p-values
> over the conventional "significance breaks" with
> 
> hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)
> 
> and you can see your probability of getting a "significant" result
> as e.g. sum(pvals < 0.05)/1000
> 
> I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1
> corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly
> interesting degree of nonlinearity), I had to go up to N=2000
> before I was getting more than 80% of the p-values < 0.05.
> That corresponds to 2000 observations at each value of x, or
> 10,000 observations in all.
> 
> Compare this with a similar test for non-linearity with
> normally-distributed responses [exercise for the reader].
> 
> You can write functions similar to testlin2 for higher-order
> nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for
> a*x + b*x^2 + c*x^3, etc., (the modifications required are
> obvious) and see how you get on. As I say, don't be optimistic!
> 
> In particular, run testlin3 a few times and see the sort of
> mess that can come out -- in particular gruesome correlations,
> which is why "correlation=TRUE" is set in the call to
> summary(glm(...),correlation=TRUE).
> 
> Best wishes,
> Ted.

library(Design)  # also requires Hmisc
f <- lrm(sick ~ sex + rcs(age,4) + rcs(blood.pressure,5))
# Restricted cubic spline in age with 4 knots, blood.pressure with 5
anova(f)  # automatic tests of linearity of all predictors
latex(f)  # see algebraic form of fit
summary(f)# get odds ratios for meaningful changes in predictors

But beware of using the tests of linearity.  If non-significant results 
cause you to reduce an effect to linear, confidence levels and type I 
errors are no longer preserved.  I use tests of linearity mainly to 
demonstrate that effects are more often nonlinear than linear, given 
sufficient sample size.  I.e., good analysts are needed.  I usually 
leave non-significant nonlinearities in the model.

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University