[R] Problem with binomial gam{mgcv}

Sat Oct 10 13:40:07 CEST 2015

On 09/10/2015 8:00 PM, Erin Conlisk wrote:
> Hello,
> 
> I am having trouble testing for the significance using a binomial model in
> gam{mgcv}.  Have I stumbled on a bug?  I doubt I would be so lucky, so
> could someone tell me what I am doing wrong?
> 
> Please see the following code:
> ________________________________
> 
> # PROBLEM USING cbind
> 
> x1 <- runif(500, 0, 100)  # Create 500 random variables to use as my
> explanatory variable
> 
> y1 <- floor(runif(500, 0, 100)) # Create 500 random counts to serve as
> binomial "successes"
> 
> y2 <- 100-y1 # Create 500 binomial "failures", assuming a total of 100
> trials and the successes recorded in y1
> 
> Model <- gam(cbind(y1, y2) ∼ 1 + s(x1), family=binomial)
> summary(Model)
> ________________________________
> 
> The result is that my random variable, x1, is highly significant.  This
> can't be right...

The problem is that statistical significance of a test doesn't mean that
the alternative you have in mind is right, it just means that the null
is wrong (or you were unlucky, but let's ignore that).

The null hypothesis here is that all of the y1 values are independent
binomial values with a common n=100 and common unspecifed probability of
success.

Since in fact y1 comes from a discrete uniform distribution, that's
false, and the p-value is not anywhere near being a random Unif(0,1) value.

If you wanted the null hypothesis to be true, you'd need to choose a
success probability p somehow, then set y1 <- rbinom(500, size=100, prob
= p).  The random uniform has a far larger variance, and that leads to a
larger deviance in gam, hence significance.

Duncan Murdoch

> 
> So what happens when I change the observations from a "batch" of 100 trials
> to individual successes and failures?
> ________________________________
> 
> # NOW MAKE ALL THESE DATA 0 and 1
> 
> r01<-rep(0,500)
> data01<-cbind(x1, y1, y2, r01)
> rownames(data01)<-seq(1,500, 1)
> colnames(data01)<-c('x1', 'y1', 'y2', 'r01')
> data01<-data.frame(data01)
> 
> expanded0 <- data01[rep(row.names(data01), data01$y1), 1:4]  # Creates a
> replicate of the      #  explanatory variables for each individual "success"
> 
> r01<-rep(1,500)
> data01<-cbind(x1, y1, y2, r01)
> rownames(data01)<-seq(1,500, 1)
> colnames(data01)<-c('x1', 'y1', 'y2', 'r01')
> data01<-data.frame(data01)
> 
> expanded1 <- data01[rep(row.names(data01), data01$y2), 1:4]  # Creates a
> replicate of the      #  explanatory variables for each individual "failure"
> 
> data01<-rbind(expanded0,expanded1)
> 
> Model2 <- gam(r01 ∼ 1 + s(x1), family=binomial)
> summary(Model2)
> ___________________________________
> 
> The result is what I expect.  Now my random variable, x1, is NOT
> significant.
> 
> What is going on here?
> 
> I should say that I didn't just make this up.  My question arose playing
> with my real data, where I was getting high significance, but a terrible
> proportion of deviance explained.
> 
> My apologies if this is explained elsewhere, but I have spent hours
> searching for an answer online.
> 
> Thank you kindly,
> Erin Conlisk
>