[R] bigglm() results different from glm()

Wed Mar 18 02:33:14 CET 2009

Dear Thomas and John,

Thanks for your prompt reply and for looking at your code to explain 
these differences. I see you replied very late at night, so I am sorry 
if my little problem kept you awake!

As you pointed out, the model indeed converges (as reported in 
model$converged) once I specify a larger number of iterations.

A very minor comment: it seems that the reporting of the estimates in 
summary.biglm() is not taking the parameters from options(digits).  For 
example, using the same data and models as before:

 > require(biglm)
 > options(digits=8)
 > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), 
ttment=gl(2,50000))
 > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
 > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"), 
maxit=20)

 > summary(m1)
<Snipped>
Coefficients:
              Estimate Std. Error z value  Pr(>|z|)
(Intercept) 2.3019509  0.0014147 1627.21 < 2.2e-16 ***
ttment2     0.4052002  0.0018264  221.86 < 2.2e-16 ***

 > summary(m1big)
              Coef  (95%   CI)    SE p
(Intercept) 2.302 2.299 2.305 0.001 0
ttment2     0.405 0.402 0.409 0.002 0

To get more digits I can extract the point estimates using coef(m1big), 
but after looking at str(m1big) the only way I could figure to extract 
the p-values was:

 > summary(m1big)$mat[,"p"]
(Intercept)     ttment2
           0           0

Is there a way I can get summary.biglm() to report more digits directly?

Thanks again,

Francisco

-- 
Francisco J. Zagmutt
Vose Consulting
2891 20th Street
Boulder, CO, 80304
USA
www.voseconsulting.com

Thomas Lumley wrote:
> 
> Yes, the slow convergence is easier to get with the log link.  
> Overshooting the correct coefficient vector has more dramatic effects on 
> the fitted values and weights, and in this example the starting value of 
> (0,0) is a long way from the truth.   The same sort of thing happens in 
> the Cox model, where there are real data sets that will cause numeric 
> overflow in a careless implementation.
> 
> It might be worth trying to guess better starting values: saving an 
> iteration or two would be useful with large data sets.
> 
>      -thomas
> 
> 
> On Tue, 17 Mar 2009, John Fox wrote:
> 
>> Dear Francisco,
>>
>> I was able to duplicate the problem that you reported, and in addition
>> discovered that the problem seems to be peculiar to the poisson family.
>> lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
>> results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
>> Another example, with the binomial family:
>>
>>> set.seed(12345)
>>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>> ttment=gl(2,50000))
>>> dat$y0 <- ifelse(dat$y > 12, 1, 0)
>>> m1b <- glm(y0~ttment, data=dat, family=binomial)
>>> m1bigb <- bigglm(y0~ttment , data=dat, family=binomial())
>>> coef(m1b)
>> (Intercept)     ttment2
>>   -1.33508     2.34263
>>> coef(m1bigb)
>> (Intercept)     ttment2
>>   -1.33508     2.34263
>>> deviance(m1b)
>> [1] 109244
>>> deviance(m1bigb)
>> [1] 109244
>>
>> I'm copying this message to Thomas Lumley, who's the author and 
>> maintainer
>> of the biglm package.
>>
>> Regards,
>> John
>>
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> On
>>> Behalf Of Francisco J. Zagmutt
>>> Sent: March-16-09 10:26 PM
>>> To: r-help at stat.math.ethz.ch
>>> Subject: [R] bigglm() results different from glm()
>>>
>>> Dear all,
>>>
>>> I am using the bigglm package to fit a few GLM's to a large dataset (3
>>> million rows, 6 columns).  While trying to fit a Poisson GLM I noticed
>>> that the coefficient estimates were very different from what I obtained
>>> when estimating the model on a smaller dataset using glm(), I wrote a
>>> very basic toy example to compare the results of bigglm() against a
>>> glm() call.  Consider the following code:
>>>
>>>
>>> > require(biglm)
>>> > options(digits=6, scipen=3, contrasts = c("contr.treatment",
>>> "contr.poly"))
>>> > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>> ttment=gl(2,50000))
>>> > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>>> > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>>> > summary(m1)
>>>
>>> <snipped output for this email>
>>> Coefficients:
>>>              Estimate Std. Error z value Pr(>|z|)
>>> (Intercept)  2.30305    0.00141    1629   <2e-16 ***
>>> ttment2      0.40429    0.00183     221   <2e-16 ***
>>>
>>>      Null deviance: 151889  on 99999  degrees of freedom
>>> Residual deviance: 101848  on 99998  degrees of freedom
>>> AIC: 533152
>>>
>>> > summary(m1big)
>>> Large data regression model: bigglm(y ~ ttment, data = dat, family =
>>> poisson(link = "log"))
>>> Sample size =  100000
>>>               Coef  (95%   CI)    SE p
>>> (Intercept) 2.651 2.650 2.653 0.001 0
>>> ttment2     4.346 4.344 4.348 0.001 0
>>>
>>> > m1big$deviance
>>> [1] 287158986
>>>
>>>
>>> Notice that the coefficients and deviance are quite different in the
>>> model estimated using bigglm(). If I change the chunk to
>>> seq(1000,10000,1000) the estimates remain the same.
>>>
>>> Can someone help me understand what is causing these differences?
>>>
>>> Here is my version info:
>>>
>>> > version
>>>                 _
>>> platform       i386-pc-mingw32
>>> arch           i386
>>> os             mingw32
>>> system         i386, mingw32
>>> status
>>> major          2
>>> minor          8.1
>>> year           2008
>>> month          12
>>> day            22
>>> svn rev        47281
>>> language       R
>>> version.string R version 2.8.1 (2008-12-22)
>>>
>>>
>>> Many thanks in advance for your help,
>>>
>>> Francisco
>>>
>>> -- 
>>> Francisco J. Zagmutt
>>> Vose Consulting
>>> 2891 20th Street
>>> Boulder, CO, 80304
>>> USA
>>> www.voseconsulting.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> Thomas Lumley            Assoc. Professor, Biostatistics
> tlumley at u.washington.edu    University of Washington, Seattle
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>