[R] Problem with lm Giving Wrong Results

Labone, Thomas |@bone @end|ng |rom em@||@@c@edu
Fri Dec 3 17:37:29 CET 2021


Two of the machines having the problem are AVX-512 capable (e.g., i7-7820X) but another one is an old Samsung Series 5 with an i5-3317U. I guess I will start with the folks at Linux Mint.

Tom


Thomas R. LaBone
PhD student
Department of Epidemiology and Biostatistics
Arnold School of Public Health
University of South Carolina
Columbia, South Carolina USA



________________________________
From: Sarah Goslee <sarah.goslee using gmail.com>
Sent: Friday, December 3, 2021 11:00 AM
To: Labone, Thomas <labone using email.sc.edu>
Cc: Bill Dunlap <williamwdunlap using gmail.com>; r-help using r-project.org <r-help using r-project.org>
Subject: Re: [R] Problem with lm Giving Wrong Results

It might also be a BLAS+processor problem - I got bit pretty hard by
that, with an example here:

https://stat.ethz.ch/pipermail/r-help/2019-July/463477.html

With a key excerpt here:

On Thu, Jul 18, 2019 at 1:59 PM Ivan Krylov <krylov.r00t using gmail.com> wrote:
> Yes, this might be bad. I have heard about OpenBLAS (specifically, the
> matrix product routine) misbehaving on certain AVX-512 capable
> processors, so much that they had to disable some optimizations in
> 0.3.6 [*], which you already have installed. Still, would `env
> OPENBLAS_CORETYPE=Haswell R --vanilla` give a better result?
>

On Fri, Dec 3, 2021 at 10:29 AM Labone, Thomas <labone using email.sc.edu> wrote:
>
> Thanks for the feedback everyone. If you go to https://protect2.fireeye.com/v1/url?k=f8bdd2b7-a726ea7c-f8bd9c76-86ce7c8b8969-1acf41b3a3825b65&q=1&e=05c346dc-4f60-4e2e-ada8-1abaa8792515&u=https%3A%2F%2Fgithub.com%2Fcsantill%2FRPerformanceWBLAS%2Fblob%2Fmaster%2FRPerformanceBLAS.md you will find the Linux commands to change the default math library. When I switch the BLAS library from MKL to the system default (see sessionInfo below), everything works as expected. I installed version 2020.0-166-1 of "Intel-MKL" from the Linux Mint Software Manager. I may be coming to a hasty conclusion, but there appears to be something wrong with that package or how it interacts with other system software. Any suggestions on who I should notify about the problem (e.g., Intel, Mint, Ubuntu)?
>
> R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Linux Mint 20.2
>
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.1.2 tools_4.1.2
>
>
>
> Thomas R. LaBone
> PhD student
> Department of Epidemiology and Biostatistics
> Arnold School of Public Health
> University of South Carolina
> Columbia, South Carolina USA
>
>
>
> ________________________________
> From: Labone, Thomas <labone using email.sc.edu>
> Sent: Thursday, December 2, 2021 11:53 AM
> To: Bill Dunlap <williamwdunlap using gmail.com>
> Cc: r-help using r-project.org <r-help using r-project.org>
> Subject: Re: [R] Problem with lm Giving Wrong Results
>
> > summary(fit)
>
> Call:
> lm(formula = log(k) ~ Z)
>
> Residuals:
>     Min      1Q  Median      3Q     Max
> -21.241   1.327   1.776   2.245   4.418
>
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept) -0.03465    0.01916  -1.809   0.0705 .
> Z           -0.24207    0.01916 -12.634   <2e-16 ***
> ---
> Signif. codes:  0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1
>
> Residual standard error: 1.914 on 9998 degrees of freedom
> Multiple R-squared:  0.01467, Adjusted R-squared:  0.01457
> F-statistic: 148.8 on 1 and 9998 DF,  p-value: < 2.2e-16
>
> > summary(k)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  0.2735  3.7658  5.9052  7.5113  9.4399 82.9531
> > summary(Z)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> -3.8906 -0.6744  0.0000  0.0000  0.6744  3.8906
> > summary(gm*gsd^Z)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  0.3767  0.8204  0.9659  0.9947  1.1372  2.4772
> >
>
>
> Thomas R. LaBone
> PhD student
> Department of Epidemiology and Biostatistics
> Arnold School of Public Health
> University of South Carolina
> Columbia, South Carolina USA
>
>
> ________________________________
> From: Bill Dunlap <williamwdunlap using gmail.com>
> Sent: Thursday, December 2, 2021 10:31 AM
> To: Labone, Thomas <labone using email.sc.edu>
> Cc: r-help using r-project.org <r-help using r-project.org>
> Subject: Re: [R] Problem with lm Giving Wrong Results
>
> On the 'bad' machines, what did you get for
>    summary(fit)
>    summary(k)
>    summary(Z)
>    summary(gm*gsd^Z)
> ?
>
> -Bill
>
> On Thu, Dec 2, 2021 at 6:18 AM Labone, Thomas <labone using email.sc.edu<mailto:labone using email.sc.edu>> wrote:
> In the code below the first and second plots should look pretty much the same, the only difference being that the first has n=1000 points and the second n=10000 points. On two of my Linux machines (info below) the second plot is a horizontal line (incorrect answer from lm), but on my Windows 10 machine and a third Linux machine it works as expected. The interesting thing is that the code works as expected for n <= 4095 but fails for n>=4096 (which equals 2^12). Can anyone else reproduce this problem? Any ideas on how to fix it?
>
> set.seed(132)
>
> #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> # This works
> n <- 1000# OK <= 4095
> Z <- qnorm(ppoints(n))
>
> k <- sort(rlnorm(n,log(2131),log(1.61)) / rlnorm(n,log(355),log(1.61)))
>
> quantile(k,probs=c(0.025,0.5,0.975))
> summary(k)
>
> fit <- lm(log(k) ~ Z)
> summary(fit)
>
> gm <- exp(coef(fit)[1])
> gsd <- exp(coef(fit)[2])
> gm
> gsd
>
> plot(Z,k,log="y",xlim=c(-4,4),ylim=c(0.1,100))
> lines(Z,gm*gsd^Z,col="red")
>
> #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> #this does not
> n <- 10000# fails >= 4096 = 2^12
> Z <- qnorm(ppoints(n))
>
> k <- sort(rlnorm(n,log(2131),log(1.61)) / rlnorm(n,log(355),log(1.61)))
>
> quantile(k,probs=c(0.025,0.5,0.975))
> summary(k)
>
> fit <- lm(log(k) ~ Z)
> summary(fit)
>
> gm <- exp(coef(fit)[1])
> gsd <- exp(coef(fit)[2])
> gm
> gsd
>
> plot(Z,k,log="y",xlim=c(-4,4),ylim=c(0.1,100))
> lines(Z,gm*gsd^Z,col="red")
>
>
> #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > sessionInfo() #for two Linux machines having problem
> R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Linux Mint 20.2
>
> Matrix products: default
> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
>  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.1.2  Matrix_1.3-4    tools_4.1.2     expm_0.999-6    grid_4.1.2      lattice_0.20-45
>
> #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > sessionInfo() # for a third Linux machine not having the problem
> R version 4.1.1 (2021-08-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Linux Mint 19.3
>
> Matrix products: default
> BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_rt.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.1.1 tools_4.1.1
>
>
>
> Thomas R. LaBone
> PhD student
> Department of Epidemiology and Biostatistics
> Arnold School of Public Health
> University of South Carolina
> Columbia, South Carolina USA
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org<mailto:R-help using r-project.org> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Sarah Goslee (she/her)
https://protect2.fireeye.com/v1/url?k=c805d788-979eef43-c8059949-86ce7c8b8969-1bd7107f3d9bccca&q=1&e=05c346dc-4f60-4e2e-ada8-1abaa8792515&u=http%3A%2F%2Fwww.numberwright.com%2F

	[[alternative HTML version deleted]]



More information about the R-help mailing list