[R] bigglm() results different from glm()

Wed Mar 18 08:30:30 CET 2009

There's a digits= argument to the print method.

> a <- bigglm(ff,data=trees, chunksize=10, sandwich=TRUE)
> print(summary(a),digits=5)
Large data regression model: bigglm(ff, data = trees, chunksize = 10, sandwich = TRUE)
Sample size =  31
                 Coef     (95%      CI)      SE p
(Intercept) -6.63162 -8.08729 -5.17595 0.72783 0
log(Girth)   1.98265  1.87132  2.09398 0.05567 0
log(Height)  1.11712  0.73339  1.50085 0.19186 0
Sandwich (model-robust) standard errors

I suppose I should make it take its default from options(digits)-3 or something.

      -thomas

On Tue, 17 Mar 2009, Francisco J. Zagmutt wrote:

> Dear Thomas and John,
>
> Thanks for your prompt reply and for looking at your code to explain these 
> differences. I see you replied very late at night, so I am sorry if my little 
> problem kept you awake!
>
> As you pointed out, the model indeed converges (as reported in model$converged) 
> once I specify a larger number of iterations.
>
> A very minor comment: it seems that the reporting of the estimates in 
> summary.biglm() is not taking the parameters from options(digits).  For 
> example, using the same data and models as before:
>
>> require(biglm)
>> options(digits=8)
>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))
>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"), maxit=20)
>
>> summary(m1)
> <Snipped>
> Coefficients:
>             Estimate Std. Error z value  Pr(>|z|)
> (Intercept) 2.3019509  0.0014147 1627.21 < 2.2e-16 ***
> ttment2     0.4052002  0.0018264  221.86 < 2.2e-16 ***
>
>> summary(m1big)
>             Coef  (95%   CI)    SE p
> (Intercept) 2.302 2.299 2.305 0.001 0
> ttment2     0.405 0.402 0.409 0.002 0
>
>
> To get more digits I can extract the point estimates using coef(m1big), but 
> after looking at str(m1big) the only way I could figure to extract the p-values 
> was:
>
>> summary(m1big)$mat[,"p"]
> (Intercept)     ttment2
>          0           0
>
> Is there a way I can get summary.biglm() to report more digits directly?
>
> Thanks again,
>
> Francisco
>
>
>
> -- 
> Francisco J. Zagmutt
> Vose Consulting
> 2891 20th Street
> Boulder, CO, 80304
> USA
> www.voseconsulting.com
>
>
> Thomas Lumley wrote:
>> 
>> Yes, the slow convergence is easier to get with the log link.  Overshooting 
>> the correct coefficient vector has more dramatic effects on the fitted 
>> values and weights, and in this example the starting value of (0,0) is a 
>> long way from the truth.   The same sort of thing happens in the Cox model, 
>> where there are real data sets that will cause numeric overflow in a 
>> careless implementation.
>> 
>> It might be worth trying to guess better starting values: saving an 
>> iteration or two would be useful with large data sets.
>> 
>>      -thomas
>> 
>> 
>> On Tue, 17 Mar 2009, John Fox wrote:
>> 
>>> Dear Francisco,
>>> 
>>> I was able to duplicate the problem that you reported, and in addition
>>> discovered that the problem seems to be peculiar to the poisson family.
>>> lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
>>> results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
>>> Another example, with the binomial family:
>>> 
>>>> set.seed(12345)
>>>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>> ttment=gl(2,50000))
>>>> dat$y0 <- ifelse(dat$y > 12, 1, 0)
>>>> m1b <- glm(y0~ttment, data=dat, family=binomial)
>>>> m1bigb <- bigglm(y0~ttment , data=dat, family=binomial())
>>>> coef(m1b)
>>> (Intercept)     ttment2
>>>   -1.33508     2.34263
>>>> coef(m1bigb)
>>> (Intercept)     ttment2
>>>   -1.33508     2.34263
>>>> deviance(m1b)
>>> [1] 109244
>>>> deviance(m1bigb)
>>> [1] 109244
>>> 
>>> I'm copying this message to Thomas Lumley, who's the author and maintainer
>>> of the biglm package.
>>> 
>>> Regards,
>>> John
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On
>>>> Behalf Of Francisco J. Zagmutt
>>>> Sent: March-16-09 10:26 PM
>>>> To: r-help at stat.math.ethz.ch
>>>> Subject: [R] bigglm() results different from glm()
>>>> 
>>>> Dear all,
>>>> 
>>>> I am using the bigglm package to fit a few GLM's to a large dataset (3
>>>> million rows, 6 columns).  While trying to fit a Poisson GLM I noticed
>>>> that the coefficient estimates were very different from what I obtained
>>>> when estimating the model on a smaller dataset using glm(), I wrote a
>>>> very basic toy example to compare the results of bigglm() against a
>>>> glm() call.  Consider the following code:
>>>> 
>>>> 
>>>> > require(biglm)
>>>> > options(digits=6, scipen=3, contrasts = c("contr.treatment",
>>>> "contr.poly"))
>>>> > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>>> ttment=gl(2,50000))
>>>> > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>>>> > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>>>> > summary(m1)
>>>> 
>>>> <snipped output for this email>
>>>> Coefficients:
>>>>              Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept)  2.30305    0.00141    1629   <2e-16 ***
>>>> ttment2      0.40429    0.00183     221   <2e-16 ***
>>>> 
>>>>      Null deviance: 151889  on 99999  degrees of freedom
>>>> Residual deviance: 101848  on 99998  degrees of freedom
>>>> AIC: 533152
>>>> 
>>>> > summary(m1big)
>>>> Large data regression model: bigglm(y ~ ttment, data = dat, family =
>>>> poisson(link = "log"))
>>>> Sample size =  100000
>>>>               Coef  (95%   CI)    SE p
>>>> (Intercept) 2.651 2.650 2.653 0.001 0
>>>> ttment2     4.346 4.344 4.348 0.001 0
>>>> 
>>>> > m1big$deviance
>>>> [1] 287158986
>>>> 
>>>> 
>>>> Notice that the coefficients and deviance are quite different in the
>>>> model estimated using bigglm(). If I change the chunk to
>>>> seq(1000,10000,1000) the estimates remain the same.
>>>> 
>>>> Can someone help me understand what is causing these differences?
>>>> 
>>>> Here is my version info:
>>>> 
>>>> > version
>>>>                 _
>>>> platform       i386-pc-mingw32
>>>> arch           i386
>>>> os             mingw32
>>>> system         i386, mingw32
>>>> status
>>>> major          2
>>>> minor          8.1
>>>> year           2008
>>>> month          12
>>>> day            22
>>>> svn rev        47281
>>>> language       R
>>>> version.string R version 2.8.1 (2008-12-22)
>>>> 
>>>> 
>>>> Many thanks in advance for your help,
>>>> 
>>>> Francisco
>>>> 
>>>> -- Francisco J. Zagmutt
>>>> Vose Consulting
>>>> 2891 20th Street
>>>> Boulder, CO, 80304
>>>> USA
>>>> www.voseconsulting.com
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> Thomas Lumley            Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu    University of Washington, Seattle
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle