[R] bigglm() results different from glm()

Thomas Lumley tlumley at u.washington.edu
Tue Mar 17 08:50:20 CET 2009


This is a surprisingly interesting problem that took a while to debug, because the computations all seemed correct.

Your model hasn't converged yet.  You can get the right answer either by running longer:

> summary(m1big_longer)
Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
     chunksize = 100000, maxit = 20)
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2     0.405 0.401 0.408 0.002 0

or supplying starting values:

> summary(m1big_started)
Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
     chunksize = 100000, start = c(2, 0))
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2     0.405 0.401 0.408 0.002 0


The bug is that you weren't told about the lack of convergence.  There is a flag in the object, but it is only set after the model is converged, so it is not there when convergence fails.

> m1big$converged
NULL
> m1big_longer$converged
[1] TRUE
> m1big_started$converged
[1] TRUE

For the next version I will make sure there is a clear warning when the model hasn't converged.  The default maximum number of iterations is fairly small, by design --- if it isn't working, you want to find out and specify starting values rather than wait for dozens of potentially slow iterations.  This strategy obviously breaks down when you don't notice that failure. :(

      -thomas


On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:

> Dear all,
>
> I am using the bigglm package to fit a few GLM's to a large dataset (3 million 
> rows, 6 columns).  While trying to fit a Poisson GLM I noticed that the 
> coefficient estimates were very different from what I obtained when estimating 
> the model on a smaller dataset using glm(), I wrote a very basic toy example to 
> compare the results of bigglm() against a glm() call.  Consider the following 
> code:
>
>
>> require(biglm)
>> options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly"))
>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))
>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>> summary(m1)
>
> <snipped output for this email>
> Coefficients:
>            Estimate Std. Error z value Pr(>|z|)
> (Intercept)  2.30305    0.00141    1629   <2e-16 ***
> ttment2      0.40429    0.00183     221   <2e-16 ***
>
>    Null deviance: 151889  on 99999  degrees of freedom
> Residual deviance: 101848  on 99998  degrees of freedom
> AIC: 533152
>
>> summary(m1big)
> Large data regression model: bigglm(y ~ ttment, data = dat, family = 
> poisson(link = "log"))
> Sample size =  100000
>             Coef  (95%   CI)    SE p
> (Intercept) 2.651 2.650 2.653 0.001 0
> ttment2     4.346 4.344 4.348 0.001 0
>
>> m1big$deviance
> [1] 287158986
>
>
> Notice that the coefficients and deviance are quite different in the model 
> estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the 
> estimates remain the same.
>
> Can someone help me understand what is causing these differences?
>
> Here is my version info:
>
>> version
>               _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          8.1
> year           2008
> month          12
> day            22
> svn rev        47281
> language       R
> version.string R version 2.8.1 (2008-12-22)
>
>
> Many thanks in advance for your help,
>
> Francisco
>
> -- 
> Francisco J. Zagmutt
> Vose Consulting
> 2891 20th Street
> Boulder, CO, 80304
> USA
> www.voseconsulting.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list