[R] Increasing number of observations worsen the regression model

Sun May 26 17:09:14 CEST 2019

Raffa, 

I ran this on a MacOS machine and got what you expected. I added a call to sessionInfo() for your information.

> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))

Call:
lm(formula = yvar ~ xvar)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2407 -0.6738 -0.0031  0.6822  4.0619 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0059022  0.0057370   175.3   <2e-16 ***
xvar        2.0005811  0.0009918  2017.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9937 on 29998 degrees of freedom
Multiple R-squared:  0.9927,	Adjusted R-squared:  0.9927 
F-statistic: 4.069e+06 on 1 and 29998 DF,  p-value: < 2.2e-16

> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to feed into model
> lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    # add regression line, using `predict` to generate y-values
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.0

R. Mark Sharp, Ph.D.
Data Scientist and Biomedical Statistical Consultant
7526 Meadow Green St.
San Antonio, TX 78251
mobile: 210-218-2868
rmsharp using me.com

> On May 25, 2019, at 7:38 AM, Raffa <raffamaiden using gmail.com> wrote:
> 
> I have the following code:
> 
> ```
> 
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to 
> feed into model
> lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    # 
> add regression line, using `predict` to generate y-values
> 
> ```
> 
> I expected the coefficients to be something similar to [1,2]. Instead R 
> keeps throwing at me random numbers that are not statistically 
> significant and don't fit the model, and I have 20k observations. For 
> example
> 
> ```
> 
> Call:
> lm(formula = yvar ~ xvar)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max
> -21.384  -8.908   1.016  10.972  23.663
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145  0.0670316   0.011    0.991
> xvar        0.0168271  0.0116420   1.445    0.148
> 
> Residual standard error: 11.61 on 29998 degrees of freedom
> Multiple R-squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462
> 
> ```
> 
> 
> The strange thing is that the code works perfectly for N=200 or N=2000. 
> It's only for larger N that this thing happen U(for example, N=20000). I 
> have tried to ask for example in CrossValidated 
> <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model> 
> but the code works for them. Any help?
> 
> I am runnign R 3.6.0 on Kubuntu 19.04
> 
> Best regards
> 
> Raffaele
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.