[Rd] rnorm is not truly random used in the lm function

Thu Aug 3 18:11:33 CEST 2017

>>>>> Victor Tian <tianxu03 at gmail.com>
>>>>>     on Thu, 3 Aug 2017 09:49:57 -0400 writes:

    > To whom it may concern,
    > I happened to run the following R code just to check the layout of the
    > output, but found that the code doesn't work the way I thought it should
    > work.

yes, your expectations were wrong.

    >> lm(rnorm(100) ~ rnorm(100))

    > Call:
    > lm(formula = rnorm(100) ~ rnorm(100))

    > Coefficients:
    > (Intercept)
    > -0.07966

    > Warning messages:
    > 1: In model.matrix.default(mt, mf, contrasts) :
    > the response appeared on the right-hand side and was dropped
    > 2: In model.matrix.default(mt, mf, contrasts) :
    > problem with term 1 in model.matrix: no columns are assigned

    > It appears that rnorm(100) produces the same array of numbers on both sides
    > of the ~ sign.

Indeed.  And all this has nothing to do with lm()  but rather with
how formulas in R have been treated probably "forever".
[I assume not only in R, but rather since the time formulas
 where introduced into the S language (for "S version 3") a few
 years before R was born. But I can no longer verify or disprove
 this assumption.] 

Even more revealing may be this:

> f <- rnorm(9) ~ rnorm(9)
> str(f)
Class 'formula'  language rnorm(9) ~ rnorm(9)
  ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
> (mm <- model.matrix(f))
  (Intercept)
1           1
2           1
3           1
4           1
5           1
6           1
7           1
8           1
9           1
attr(,"assign")
[1] 0
Warning messages:
1: In model.matrix.default(f) :
  the response appeared on the right-hand side and was dropped
2: In model.matrix.default(f) :
  problem with term 1 in model.matrix: no columns are assigned
> 
---------

BTW: One of the goals of formulas,  notably in R since they got an
environment attached, is a clean way to deal with non-standard
evaluation (=: NSE).
[ Some of us would claim it is the only clean way to deal with NSE in R,
  and all new functionality using NSE should use formulas,
  but recently tidyverse-scholars have claimed to be able to deal
  with it cleanly w/o the use of formulas, but via "tidy evaluation" ]

Using random expressions in a formula is therefore typically not
a good idea, because you don't realy know when the terms in the
formula will be evaluated.
For lm() and all other good formula-based statistical modeling
functions, the evaluation happens via model.matrix().

As you've noticed from that warning, model.matrix() tries to
help the user by checking terms and eliminating those that
appear on both sides of the '~'.
This has been documented on the help page [ ?model.matrix ] for
(almost exactly 14) years, the "Details:" section ending with

 _> By convention, if the response variable also appears on the
 _> right-hand side of the formula it is dropped (with a warning),
 _> although interactions involving the term are retained.

I hope this explains the issue.
And yes:  Do *not* use rnorm() in formulas.

Martin

--
Martin Mächler 
Seminar für Statistik, ETH Zürich //  R Core Team