[R] logistic regression - glm() - example in Dalgaard's book ISwR

Sat Jul 3 17:30:41 CEST 2010

On Jul 3, 2010, at 9:00 AM, David Winsemius wrote:

> 
> On Jul 2, 2010, at 11:33 PM, Paulo Barata wrote:
> 
>> 
>> Dear R-list members,
>> 
>> I would like to pose a question about the use and results
>> of the glm() function for logistic regression calculations.
>> 
>> The question is based on an example provided on p. 229
>> in P. Dalgaard, Introductory Statistics with R, 2nd. edition,
>> Springer, 2008. By means of this example, I was trying to
>> practice the different ways of entering data in glm().
>> 
>> In his book, Dalgaard provides data from a case-based study
>> about hypertension summarized in the form of a table. He shows
>> two ways of entering the response (dependent) variable data
>> in glm(): (1) as a matrix of successes/failures (diseased/
>> healthy); (2) as the proportion of people diseased for each
>> combination of independent variable's categories.
>> 
>> I tried to enter the response variable in glm() in a third
>> way: by reconstructing, from the table, the original data
>> in a case-based format, that is, a data frame in which
>> each row shows the data for one person. In this situation,
>> the response variable would be coded as a numeric 0/1 vector,
>> 0=failure, 1=success, and so it would be entered in glm() as
>> a numeric 0/1 vector.
>> 
>> The program below presents the calculations for each of the
>> three ways of entering data - the first and second methods
>> were just copied from Dalgaard's book.
>> 
>> The three methods produce the same results with regard
>> to the estimated coefficients, when the output is seen
>> with five decimals (although regression 3 seems to have
>> produced slightly different std.errors).
>> 
>> My main question is: Why are the residual deviance, its
>> degrees of freedom and the AIC produced by regression 3
>> completely different when compared to those produced by
>> regressions 1 and 2 (which seem to produce equal results)?
>> It seems that the degrees of freedom in regressions 1
>> and 2 are based on the size (number of rows) of table d
>> (see the output of the program below), but this table is
>> just a way of summarizing the data. The degrees of
>> freedom in regressions 1 and 2 are not based on the
>> actual number of cases (people) examined, which is n=433.
> 
> I first encountered this phenomenon 25 years ago when using GLIM. The answer from my statistical betters was that the deviance is actually only established up to a constant and that it is only differences in deviance that can be properly interpreted. The same situation exists with indefinite integrals in calculus.
>> 
>> I understand that no matter the way of entering the data
>> in glm(), we are always analyzing the same data, which
>> are those presented in table format on Dalgaard's page
>> 229 (these data are in data.frame d in the program below).
>> So I understand that the three ways of entering data
>> in glm() should produce the same results.
> 
> The differences between equivalent nested models should remain the same (up to numerical accuracy).
> 
> 411.42  on 432  degrees of freedom
> -398.92  on 429
> -----------------
> 12.5       3
> 
> 14.1259  on 7  degrees of freedom
> -1.6184   on 4
> --------------
> 12.5075    3
> 
>> 
>> Secondarily, why are the std.errors in regression 3 slightly
>> different from those calculated in regressions 1 and 2?
> 
> You mean the differences 4 places to the right of the decimal???
> 
>> 
>> I am using R version 2.11.1 on Windows XP.
>> 
>> Thank you very much.
>> 
>> Paulo Barata
>> 
>> ##== begin =================================================
>> 
>> ## data in: P. Dalgaard, Introductory Statistics with R,
>> ## 2nd. edition, Springer, 2008
>> ## logistic regression - example in Dalgaard's Section 13.2,
>> ## page 229
>> 
>> rm(list=ls())
> 
> Personally, I rather annoyed when people post this particular line without commenting it out. It is basically saying that your code is terribly much more important than whatever else might be in a user's workspace.
> 
>> 
>> ## data provided on Dalgaard's page 229:
>> no.yes <- c("No","Yes")
>> smoking <- gl(2,1,8,no.yes)
>> obesity <- gl(2,2,8,no.yes)
>> snoring <- gl(2,4,8,no.yes)
>> n.tot <- c(60,17,8,2,187,85,51,23)
>> n.hyp <- c(5,2,1,0,35,13,15,8)
>> 
>> d <- data.frame(smoking,obesity,snoring,n.tot,n.hyp)
>> ## d is the data to be analyzed, in table format
>> ## d is the first table on Dalgaard's page 229
>> ## n.tot = total number of cases
>> ## n.hyp = people with hypertension
>> d
>> 
>> ## regression 1: Dalgaard's page 230
>> ## response variable entered in glm() as a matrix of
>> ## successes/failures
>> hyp.tbl <- cbind(n.hyp,n.tot-n.hyp)
>> regression1 <- glm(hyp.tbl~smoking+obesity+snoring,
>>                  family=binomial("logit"))
>> 
>> ## regression 2: Dalgaard's page 230
>> ## response variable entered in glm() as proportions
>> prop.hyp <- n.hyp/n.tot
>> regression2 <- glm(prop.hyp~smoking+obesity+snoring,
>>                  weights=n.tot,family=binomial("logit"))
>> 
>> ## regression 3 (well below): data entered in glm()
>> ## by means of 'reconstructed' variables
>> ## variables with names beginning with 'r' are
>> ## 'reconstructed' from data in data.frame d.
>> ## The objective is to reconstruct the original
>> ## data from which the table on Dalgaard's page 229
>> ## has been produced
>> 
>> rsmoking <- c(rep('No',d[1,4]),rep('Yes',d[2,4]),
>>             rep('No',d[3,4]),rep('Yes',d[4,4]),
>>             rep('No',d[5,4]),rep('Yes',d[6,4]),
>>             rep('No',d[7,4]),rep('Yes',d[8,4]))
>> rsmoking <- factor(rsmoking)
>> length(rsmoking)  # just a check
>> 
>> robesity <- c(rep('No', d[1,4]),rep('No', d[2,4]),
>>             rep('Yes',d[3,4]),rep('Yes',d[4,4]),
>>             rep('No', d[5,4]),rep('No', d[6,4]),
>>             rep('Yes',d[7,4]),rep('Yes',d[8,4]))
>> robesity <- factor(robesity)
>> length(robesity)  # just a check
>> 
>> rsnoring <- c(rep('No', d[1,4]),rep('No', d[2,4]),
>>             rep('No', d[3,4]),rep('No', d[4,4]),
>>             rep('Yes',d[5,4]),rep('Yes',d[6,4]),
>>             rep('Yes',d[7,4]),rep('Yes',d[8,4]))
>> rsnoring <- factor(rsnoring)
>> length(rsnoring)  # just a check
>> 
>> rhyp <- c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]),
>>         rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]),
>>         rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]),
>>         rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]),
>>         rep(1,d[5,5]),rep(0,d[5,4]-d[5,5]),
>>         rep(1,d[6,5]),rep(0,d[6,4]-d[6,5]),
>>         rep(1,d[7,5]),rep(0,d[7,4]-d[7,5]),
>>         rep(1,d[8,5]),rep(0,d[8,4]-d[8,5]))
>> length(rhyp)      # just a check
>> 
>> sum(n.tot)  # just a check
>> 
>> ## data.frame(rsmoking,robesity,rsnoring,rhyp) would provide
>> ## the data in a case-based format - each row would present
>> ## data for one case (one person), with response variable
>> ## coded 0/1 for No/Yes
>> ## The five first cases (people in the study):
>> data.frame(rsmoking,robesity,rsnoring,rhyp)[1:5,]
>> 
>> ## regression 3: response variable entered in glm()
>> ## as a numeric 0/1 vector
>> regression3 <- glm(rhyp~rsmoking+robesity+rsnoring,
>>                  family=binomial("logit"))
>> 
>> summary(regression1)
>> summary(regression2)
>> summary(regression3)
>> 
>> ## note different residual deviance, degrees of freedom
>> ## and AIC between regression 3 and regressions 1 and 2.
>> 
>> ##== end =================================================

FWIW, this recent (last month) reply from Doug on a similar query, may be helpful:

  http://tolstoy.newcastle.edu.au/R/e10/help/10/06/7780.html

HTH,

Marc Schwartz