[Rd] model.weights and model.offset: request for adjustment

Thu Feb 3 18:09:29 CET 2022

>>>>> Martin Maechler 
>>>>>     on Thu, 3 Feb 2022 15:21:45 +0100 writes:

>>>>> tim taylor 
>>>>>     on Thu, 3 Feb 2022 11:30:17 +0000 (GMT) writes:

    >>> On 03/02/2022 11:14 Martin Maechler <maechler using stat.math.ethz.ch> wrote:
    >>> 
    >>> 
    >>> >>>>> Ben Bolker 
    >>> >>>>>     on Tue, 1 Feb 2022 21:21:46 -0500 writes:
    >>> 
    >>> > The model.weights() and model.offset() functions from the 'stats' 
    >>> > package index possibly-missing elements of a data frame via $, e.g.
    >>> 
    >>> > x$"(offset)"
    >>> > x$"(weights)"
    >>> 
    >>> > This returns NULL without comment when x is a data frame:
    >>> 
    >>> > x <- data.frame(a=1)
    >>> > x$"(offset)"  ## NULL
    >>> > x$"(weights)"  ## NULL
    >>> 
    >>> > However, when x is a tibble we get a warning as well:
    >>> 
    >>> > x <- tibble::as_tibble(x)
    >>> > x$"(offset)"
    >>> > ## NULL
    >>> > ## Warning message:
    >>> > ## Unknown or uninitialised column: `(offset)`.
    >>> 
    >>> > I know it's not R-core's responsibility to manage forward 
    >>> > compatibility with tibbles, but in this case [[-indexing would seem to 
    >>> > be better practice in any case.
    >>> 
    >>> Yes, I would agree:  we should use  [[ instead of $ here
    >>> in order to force exact matching just as principle
    >>> 
    >>> Importantly, because  also  mf[["(weights)"]]
    >>> will return  NULL without a warning for a model/data frame, and
    >>> it seems it does so also for tibbles.
    >>> 
    >>> > Might a patch be accepted ... ?
    >>> 
    >>> That would not be necessary.
    >>> 
    >>> There's one remaining problem however:
    >>> `$` access is clearly faster than `[[` for small data frames
    >>> (because `$` is a primitive function doing everything in C, 
    >>> whereas `[[` calls the R level data frame method ).
    >>> 
    >>> Faster in both cases, i.e., when there *is* a column and when there
    >>> is none (and NULL is returned), e.g., for the first case
    >>> 
    >>> > system.time(for(i in 1:20000) df[["a"]])
    >>> user  system elapsed 
    >>> 0.064   0.000   0.065 
    >>> > system.time(for(i in 1:20000) df$a)
    >>> user  system elapsed 
    >>> 0.009   0.000   0.009 
    >>> 
    >>> So that's probably been the reason why  `$`  has been prefered?

    >> Would .subset2(df, "a) be preferable?

    R> df <- mtcars
    R> system.time(for(i in 1:20000) df[["hp"]])
    >> user  system elapsed 
    >> 0.078   0.000   0.078 
    R> system.time(for(i in 1:20000) df$hp)
    >> user  system elapsed 
    >> 0.011   0.000   0.010 
    R> system.time(for(i in 1:20000) .subset2(df,"hp"))
    >> user  system elapsed 
    >> 0.004   0.000   0.004 

    >> Tim

    > Yes, I think that's a very good idea --

    > notably, as interestingly it seems to work with tibble's very
    > well, too.

Interestingly (or not), changing this also fixes a real (rare!) bug:
When digging for a regression test, I've stumbled over an lm() example,
which when modified to use the not so common  "(weight)_2" as
*predictor* variable name it started to use that both as
predictor and also as weight (of some kind) such that the fit
changed.

This problem went away after apply the change,
[replacing `a$b` with `.subset2(a,b)]

Now committed to R-devel, svn rev 81650.

If there are no negative effects, this may also be backported to
R-patched.

Thank you both, once more!
Martin