[R] logistic regression - glm() - example in Dalgaard's book ISwR

Sat Jul 3 16:00:07 CEST 2010

On Jul 2, 2010, at 11:33 PM, Paulo Barata wrote:

>
> Dear R-list members,
>
> I would like to pose a question about the use and results
> of the glm() function for logistic regression calculations.
>
> The question is based on an example provided on p. 229
> in P. Dalgaard, Introductory Statistics with R, 2nd. edition,
> Springer, 2008. By means of this example, I was trying to
> practice the different ways of entering data in glm().
>
> In his book, Dalgaard provides data from a case-based study
> about hypertension summarized in the form of a table. He shows
> two ways of entering the response (dependent) variable data
> in glm(): (1) as a matrix of successes/failures (diseased/
> healthy); (2) as the proportion of people diseased for each
> combination of independent variable's categories.
>
> I tried to enter the response variable in glm() in a third
> way: by reconstructing, from the table, the original data
> in a case-based format, that is, a data frame in which
> each row shows the data for one person. In this situation,
> the response variable would be coded as a numeric 0/1 vector,
> 0=failure, 1=success, and so it would be entered in glm() as
> a numeric 0/1 vector.
>
> The program below presents the calculations for each of the
> three ways of entering data - the first and second methods
> were just copied from Dalgaard's book.
>
> The three methods produce the same results with regard
> to the estimated coefficients, when the output is seen
> with five decimals (although regression 3 seems to have
> produced slightly different std.errors).
>
> My main question is: Why are the residual deviance, its
> degrees of freedom and the AIC produced by regression 3
> completely different when compared to those produced by
> regressions 1 and 2 (which seem to produce equal results)?
> It seems that the degrees of freedom in regressions 1
> and 2 are based on the size (number of rows) of table d
> (see the output of the program below), but this table is
> just a way of summarizing the data. The degrees of
> freedom in regressions 1 and 2 are not based on the
> actual number of cases (people) examined, which is n=433.

I first encountered this phenomenon 25 years ago when using GLIM. The  
answer from my statistical betters was that the deviance is actually  
only established up to a constant and that it is only differences in  
deviance that can be properly interpreted. The same situation exists  
with indefinite integrals in calculus.
>
> I understand that no matter the way of entering the data
> in glm(), we are always analyzing the same data, which
> are those presented in table format on Dalgaard's page
> 229 (these data are in data.frame d in the program below).
> So I understand that the three ways of entering data
> in glm() should produce the same results.

The differences between equivalent nested models should remain the  
same (up to numerical accuracy).

  411.42  on 432  degrees of freedom
-398.92  on 429
-----------------
12.5       3

  14.1259  on 7  degrees of freedom
-1.6184   on 4
--------------
12.5075    3

>
> Secondarily, why are the std.errors in regression 3 slightly
> different from those calculated in regressions 1 and 2?

You mean the differences 4 places to the right of the decimal???

>
> I am using R version 2.11.1 on Windows XP.
>
> Thank you very much.
>
> Paulo Barata
>
> ##== begin =================================================
>
> ## data in: P. Dalgaard, Introductory Statistics with R,
> ## 2nd. edition, Springer, 2008
> ## logistic regression - example in Dalgaard's Section 13.2,
> ## page 229
>
> rm(list=ls())

Personally, I rather annoyed when people post this particular line  
without commenting it out. It is basically saying that your code is  
terribly much more important than whatever else might be in a user's  
workspace.

>
> ## data provided on Dalgaard's page 229:
> no.yes <- c("No","Yes")
> smoking <- gl(2,1,8,no.yes)
> obesity <- gl(2,2,8,no.yes)
> snoring <- gl(2,4,8,no.yes)
> n.tot <- c(60,17,8,2,187,85,51,23)
> n.hyp <- c(5,2,1,0,35,13,15,8)
>
> d <- data.frame(smoking,obesity,snoring,n.tot,n.hyp)
> ## d is the data to be analyzed, in table format
> ## d is the first table on Dalgaard's page 229
> ## n.tot = total number of cases
> ## n.hyp = people with hypertension
> d
>
> ## regression 1: Dalgaard's page 230
> ## response variable entered in glm() as a matrix of
> ## successes/failures
> hyp.tbl <- cbind(n.hyp,n.tot-n.hyp)
> regression1 <- glm(hyp.tbl~smoking+obesity+snoring,
>                   family=binomial("logit"))
>
> ## regression 2: Dalgaard's page 230
> ## response variable entered in glm() as proportions
> prop.hyp <- n.hyp/n.tot
> regression2 <- glm(prop.hyp~smoking+obesity+snoring,
>                   weights=n.tot,family=binomial("logit"))
>
> ## regression 3 (well below): data entered in glm()
> ## by means of 'reconstructed' variables
> ## variables with names beginning with 'r' are
> ## 'reconstructed' from data in data.frame d.
> ## The objective is to reconstruct the original
> ## data from which the table on Dalgaard's page 229
> ## has been produced
>
> rsmoking <- c(rep('No',d[1,4]),rep('Yes',d[2,4]),
>              rep('No',d[3,4]),rep('Yes',d[4,4]),
>              rep('No',d[5,4]),rep('Yes',d[6,4]),
>              rep('No',d[7,4]),rep('Yes',d[8,4]))
> rsmoking <- factor(rsmoking)
> length(rsmoking)  # just a check
>
> robesity <- c(rep('No', d[1,4]),rep('No', d[2,4]),
>              rep('Yes',d[3,4]),rep('Yes',d[4,4]),
>              rep('No', d[5,4]),rep('No', d[6,4]),
>              rep('Yes',d[7,4]),rep('Yes',d[8,4]))
> robesity <- factor(robesity)
> length(robesity)  # just a check
>
> rsnoring <- c(rep('No', d[1,4]),rep('No', d[2,4]),
>              rep('No', d[3,4]),rep('No', d[4,4]),
>              rep('Yes',d[5,4]),rep('Yes',d[6,4]),
>              rep('Yes',d[7,4]),rep('Yes',d[8,4]))
> rsnoring <- factor(rsnoring)
> length(rsnoring)  # just a check
>
> rhyp <- c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]),
>          rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]),
>          rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]),
>          rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]),
>          rep(1,d[5,5]),rep(0,d[5,4]-d[5,5]),
>          rep(1,d[6,5]),rep(0,d[6,4]-d[6,5]),
>          rep(1,d[7,5]),rep(0,d[7,4]-d[7,5]),
>          rep(1,d[8,5]),rep(0,d[8,4]-d[8,5]))
> length(rhyp)      # just a check
>
> sum(n.tot)  # just a check
>
> ## data.frame(rsmoking,robesity,rsnoring,rhyp) would provide
> ## the data in a case-based format - each row would present
> ## data for one case (one person), with response variable
> ## coded 0/1 for No/Yes
> ## The five first cases (people in the study):
> data.frame(rsmoking,robesity,rsnoring,rhyp)[1:5,]
>
> ## regression 3: response variable entered in glm()
> ## as a numeric 0/1 vector
> regression3 <- glm(rhyp~rsmoking+robesity+rsnoring,
>                   family=binomial("logit"))
>
> summary(regression1)
> summary(regression2)
> summary(regression3)
>
> ## note different residual deviance, degrees of freedom
> ## and AIC between regression 3 and regressions 1 and 2.
>
> ##== end =================================================
>
> ----------------------------------------------------------
> Paulo Barata
> Fundacao Oswaldo Cruz - Oswaldo Cruz Foundation

David Winsemius, MD
West Hartford, CT