[R] Different results in glm() probit model using vector vs. two-column matrix response

Thu Dec 30 23:03:14 CET 2010

On Dec 30, 2010, at 22:01 , Lensing, Shelly Y wrote:

(Once will do...)

> Hi - I am fitting a probit model using glm(), and the deviance and residual degrees of freedom are different depending on whether I use a binary response vector of length 80 or a two-column matrix response (10 rows) with the number of success and failures in each column. I would think that these would be just two different ways of specifying the same model, but this does not appear to be the case. 
> 
> Binary response vector gives:
> Residual deviance:  43.209  on 77  degrees of freedom
> 
> Two-column matrix response gives:
> Residual deviance:  4.9204  on 7  degrees of freedom
> 
> I'd like to understand why the two-column response format gives a residual degrees of freedom of 7, and why the weights for one is nearly, but not exactly, a multiple of the other. I need the deviance, df, and weights for another formula, which is why I'm focused on these. My code is below. Thank you in advance for any assistance! Shelly
> 

Easy: When given in the binary form, glm() has no way to know that your data comes from 10 homogenous groups (consider extending the model with a covariate like z <- runif(80) ). So the "saturated model" is simply bigger. To  recover the 7 DF deviance, interject a model that represents the the 10 groups (i.e., unit <- gl(10,8); fu <- glm(y~unit,.....); anova(fu, f180),  provided that I follow your code correctly).

> ****
> 
> # 10 record set-up
> group <- gl(2, 5, 10, labels=c("U","M"))    
> dose  <- rep(c(7, 8, 9, 10, 11), 2)
> ldose <- log10(dose)  
> n     <- c(8,8,8,8,8,8,8,8,8,8)
> r     <- c(0,1,3,8,8,0,0,0,4,5)
> p     <- r/n
> d     <- data.frame(group, dose, ldose, n, r, p)
> SF <- cbind(success=d$r, failure=d$n - d$r)
> 
> #80 record set-up
> dose2<-c(7,8,9,10,11)
> doserep<-sort(rep(dose2,8))
> x<-c(doserep,doserep)
> log10x<-log10(x)
> y_U<-c(rep(0,8), 1, rep(0, 7), 1, 1, 1, rep(0,5), rep(1, 16))
> y_M<-c(rep(0,24), rep(1,4), rep(0,4), rep(1,5), rep(0,3))
> y<-c(y_U, y_M)
> trt<-c(rep(1, 40), rep(0, 40))
> 
> # print x & y's for both
> SF
> y
> ldose
> log10x
> 
> # analysis with 10 records and 80 records
> f1 <- glm(SF ~ group + ldose, family=binomial(link="probit"))
> f3 <- glm(SF ~         ldose, family=binomial(link="probit"))
> f180 <- glm(y ~ trt + log10x, family=binomial(link="probit"))
> f380 <- glm(y ~       log10x, family=binomial(link="probit"))
> 
> summary(f1)
> summary(f180)
> 
> f1$weights
> f180$weights
> # check weights divided by 8 to see if match -- match several decimal places, 
> # but not exactly
> f1$weights/8
> 
> ****
> 
> Shelly Lensing
> Biostatistics / University of Arkansas for Medical Sciences
> 
> Confidentiality Notice: This e-mail message, including a...{{dropped:7}}
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com