[R] Systematically biased count data regression model

Gabor Grothendieck ggrothendieck at gmail.com
Fri Aug 10 03:05:23 CEST 2007


Perhaps you don't really need to predict the precise count.
Maybe its good enough to predict whether the count is above
or below average.  In that case the model is 74% correct on a
holdout sample of the last 54 points based on a model of the
first 200 points.

> # create model on first 200 and predict on rest
> DD <- data$D > mean(data$D)
> mod <- glm(DD ~., data[-1], family = binomial, subset = 1:200)
> tab <- table(predict(mod, data[201:254,-1], type = "resp") > .5, DD[201:254])
> sum(tab * diag(2)) / sum(tab)
[1] 0.7407407


On 8/9/07, Matthew and Kim Bowser <matthewnkim at gmail.com> wrote:
> Dear all,
>
> I am attempting to explain patterns of arthropod family richness
> (count data) using a regression model.  It seems to be able to do a
> pretty good job as an explanatory model (i.e. demonstrating
> relationships between dependent and independent variables), but it has
> systematic problems as a predictive model:  It is biased high at low
> observed values of family richness and biased low at high observed
> values of family richness (see attached pdf).  I have tried diverse
> kinds of reasonable regression models mostly as in Zeileis, et al.
> (2007), as well as transforming my variables, both with only small
> improvements.
>
> Do you have suggestions for making a model that would perform better
> as a predictive model?
>
> Thank you for your time.
>
> Sincerely,
>
> Matthew Bowser
>
> STEP student
> USFWS Kenai National Wildlife Refuge
> Soldotna, Alaska, USA
>
> M.Sc. student
> University of Alaska Fairbanks
> Fairbankse, Alaska, USA
>
> Reference
>
> Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
> count data in R. Technical Report 53, Department of Statistics and
> Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
> http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
>
> Code
>
> `data` <-
> structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4,
> 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15,
> 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13,
> 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12,
> 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8,
> 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8,
> 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8,
> 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16,
> 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12,
> 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6,
> 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3,
> 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3,
> 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15,
> 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159,
> 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161,
> 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161,
> 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176,
> 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165,
> 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167,
> 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164,
> 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173,
> 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162,
> 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166,
> 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164,
> 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172,
> 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174,
> 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174,
> 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170,
> 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172,
> 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160,
> 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176,
> 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166,
> 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166,
> 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253,
> 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26,
> 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194,
> 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354,
> 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312,
> 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122,
> 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28,
> 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34,
> 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221,
> 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232,
> 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
> 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2,
> 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38,
> 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288,
> 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162,
> 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306,
> 0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208,
> 0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25,
> 0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462,
> 0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275,
> 0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135,
> 0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325,
> 0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047,
> -0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0,
> 0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467,
> -0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228,
> 0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246, 0.376,
> -0.185, 0.342, 0.142, -0.075, 0.186, 0.333, 0.112, 0.272, 0.277,
> 0.203, 0.37, 0.465, 0.425), VegS = c(14, 11, 18, 21, 31, 20,
> 11, 17, 10, 15, 27, 8, 17, 13, 16, 9, 10, 16, 10, 15, 11, 9,
> 14, 11, 11, 10, 24, 18, 12, 6, 25, 21, 25, 8, 14, 18, 11, 16,
> 20, 16, 10, 16, 18, 14, 13, 11, 15, 23, 11, 28, 12, 17, 12, 18,
> 10, 15, 7, 15, 9, 16, 18, 16, 20, 18, 12, 19, 16, 18, 20, 15,
> 24, 9, 15, 9, 16, 14, 17, 14, 7, 9, 9, 12, 13, 15, 14, 11, 17,
> 8, 14, 15, 12, 8, 10, 12, 8, 16, 15, 22, 16, 21, 10, 15, 20,
> 14, 27, 21, 19, 22, 21, 11, 13, 10, 13, 14, 9, 22, 22, 20, 12,
> 16, 20, 19, 26, 14, 13, 23, 14, 22, 19, 15, 28, 16, 20, 25, 10,
> 19, 0, 10, 8, 11, 17, 13, 17, 23, 37, 32, 19, 26, 12, 11, 24,
> 11, 21, 25, 8, 15, 21, 31, 17, 0, 12, 6, 23, 19, 29, 14, 9, 0,
> 18, 23, 20, 15, 15, 17, 27, 17, 2, 24, 17, 16, 26, 11, 23, 24,
> 10, 26, 21, 12, 20, 12, 29, 22, 20, 16, 41, 19, 27, 28, 10, 35,
> 28, 23, 14, 5, 23, 15, 17, 12, 11, 24, 11, 14, 7, 7, 27, 17,
> 15, 10, 16, 2, 11, 21, 18, 15, 8, 17, 10, 18, 15, 18, 31, 9,
> 14, 11, 19, 22, 8, 19, 17, 18, 25, 11, 17, 32, 25, 18, 21, 19,
> 35, 14, 29, 9, 28, 14), T = c(13, 15.4, 15.6, 12.3, 12.7, 13.3,
> 6, 13, 9.2, 17.8, 9.4, 14.2, 8.7, 14, 8.6, 13.9, 9.2, 15.1, 9.4,
> 16.5, 14, 11.5, 15.5, 13.3, 12.7, 14, 10.5, 14, 10.1, 16.7, 15.2,
> 11.2, 11.7, 17.9, 13.3, 13.9, 8.7, 16.7, 7.8, 7.9, 10.9, 15.5,
> 14, 14.1, 14.3, 13.3, 11.6, 16.5, 12.7, 12.6, 8.3, 9, 12.4, 15,
> 11.8, 14.1, 10.5, 12.4, 10.5, 17.5, 9.2, 16.3, 5.3, 5.9, 11.9,
> 8.9, 7.7, 15.2, 8.6, 13.2, 9.5, 15.8, 12.5, 13.8, 10.7, 10.5,
> 7.7, 11, 9.3, 14.6, 12, 15.4, 12.3, 14, 8.3, 19.8, 15.5, 14.3,
> 9, 7.6, 15.3, 12.8, 14.4, 14, 10.5, 8.9, 13.4, 12.8, 12.9, 11.2,
> 13.1, 10, 12.4, 15.4, 7.6, 14.9, 13.1, 11.1, 8.6, 13.6, 8.4,
> 11.5, 12.5, 15.6, 8.3, 9.6, 8.7, 9.7, 10.5, 12.8, 8.6, 12.7,
> 6.7, 8.5, 9.9, 8.3, 12.8, 9.8, 10.5, 10.7, 9.6, 11.1, 13.7, 9.5,
> 8.8, 3.4, 10.2, 3.5, 8.4, 11.9, 12.3, 10.2, 13.1, 8.4, 3.1, 7.2,
> 13.2, 7.6, 11.3, 13.5, 7.8, 5.5, 10.7, 4.8, 7.9, 13.5, 13.5,
> 2.1, 7.2, 9.7, 12, 9.2, 13.2, 12, 17.1, 7.9, 12.7, 11.8, 16,
> 6.4, 12.9, 8.1, 12.6, 10.2, 13.3, 8.5, 9.7, 9.9, 18.2, 11, 8.4,
> 1.5, 7.3, 10.6, 13.6, 8.4, 7.2, 13, 15.5, 9.7, 13.2, 5.9, 9.5,
> 10.4, 12.9, 2.6, 17.2, 15.4, 10.5, 6.7, 6.6, 7.6, 10.5, 15.6,
> 10.4, 5.1, 11, 9.7, 4.2, 3.6, 8.5, 11.5, 8.4, 6.9, 11, 10.4,
> 3.4, 3.2, 5.5, 2.4, 11.2, 2.6, 15.1, 16, 13.7, 10.5, 3.5, 13.4,
> 11.5, 12.3, 13.9, 14.5, 12.8, 16.8, 16.9, 13.5, 17.2, 12.5, 12.4,
> 11.8, 12, 10.9, 6.7, 10.9, 2.3, 5.2, 13.1, 12.1, 13.9, 12.9,
> 7.2, 12.5, 16, 11.7), Hemlock = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0), Alpine = c(0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0),
>   Snow = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
>   1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("D", "Day", "NDVI",
> "VegS", "T", "Hemlock", "Alpine", "Snow"), row.names = as.integer(c(NA,
> 254)), class = "data.frame")
>
>
> #Running a regression.
> library(MASS)
> fit <- glm.nb(D ~ Day + NDVI + VegS + T + Hemlock + Alpine + Snow, data = data)
> summary(fit, correlation = FALSE)
>
> Call:
> glm.nb(formula = D ~ Day + NDVI + VegS + T + Hemlock + Alpine +
>   Snow, data = data, init.theta = 11.3494468596771, link = log)
>
> Deviance Residuals:
>   Min       1Q   Median       3Q      Max
> -3.7451  -0.7196  -0.1958   0.5389   2.7096
>
> Coefficients:
>            Estimate Std. Error z value Pr(>|z|)
> (Intercept) -2.882684   0.929598  -3.101 0.001929 **
> Day          0.020325   0.005540   3.669 0.000244 ***
> NDVI         1.353361   0.221471   6.111 9.91e-10 ***
> VegS         0.016731   0.004931   3.393 0.000691 ***
> T            0.074189   0.009491   7.817 5.42e-15 ***
> Hemlock     -0.588858   0.174980  -3.365 0.000765 ***
> Alpine      -0.452199   0.099296  -4.554 5.26e-06 ***
> Snow        -1.902610   0.735708  -2.586 0.009707 **
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for Negative Binomial(11.3494) family taken to be 1)
>
>   Null deviance: 515.16  on 253  degrees of freedom
> Residual deviance: 278.89  on 246  degrees of freedom
> AIC: 1300.1
>
> Number of Fisher Scoring iterations: 1
>
>
>             Theta:  11.35
>         Std. Err.:  2.71
>
>  2 x log-likelihood:  -1282.075
>
>
> #Plotting observed versus predicted values.
> pdf(file="ObsVPred.pdf", width=4, height=4, family="Times", pointsize=11)
> par(mar = c(5,5,1,1), pch=1)
> plot(data$D, fit$fitted.values, main="",
> ylab=expression(italic(D)[predicted]),
> xlab=expression(italic(D)[observed]))
> abline(a=0,b=1, lty=2)
> lines(lowess(data$D, fit$fitted.values))
> dev.off()
>
>
> #This appears to be a decent explanatory model, but as a predictive
> model it is systematically biased.  It is biased high at low observed
> values of D and biased low at high values observed values of D.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list