[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Mon Jan 3 16:54:26 CET 2022

>>>>> Ben Bolker 
>>>>>     on Mon, 27 Dec 2021 09:43:42 -0500 writes:

    >    I agree that it seems non-intuitive (I can't think of a
    > design reason for it to look this way), but I'd like to
    > stress that it's *not* an information leak; the
    > predictions of the model are independent of the
    > parameterization, which is all this issue affects. In a
    > worst case there might be some unfortunate effects on
    > numerical stability if the data-dependent bases are
    > computed on a very different set of data than the model
    > fitting actually uses.

    >    I've attached a suggested documentation patch (I hope
    > it makes it through to the list, if not I can add it to
    > the body of a message.)

It did make it through;  thank you, Ben!
( After adding two forgotten '}' ) I've committed the help file
additions to the R sources (R-devel) in svn r81434 .

Thanks again and

       "Happy New Year"

to all readers,


    > On 12/26/21 8:35 PM, Balise, Raymond R wrote:
    >> Hello R folks, Today I noticed that using the subset
    >> argument in lm() with a polynomial gives a different
    >> result than using the polynomial when the data has
    >> already been subsetted. This was not at all intuitive for
    >> me.  You can see an example here:
    >> https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i
    >> If this is a design feature that you don’t think should
    >> be fixed, can you please include it in the documentation
    >> and explain why it makes sense to figure out the
    >> orthogonal polynomials on the entire dataset?  This feels
    >> like a serous leak of information when evaluating train
    >> and test datasets in a statistical learning framework.
    >> Ray
    >> Raymond R. Balise, PhD Assistant Professor Department of
    >> Public Health Sciences, Biostatistics
    >> University of Miami, Miller School of Medicine 1120
    >> N.W. 14th Street Don Soffer Clinical Research Center -
    >> Room 1061 Miami, Florida 33136
    >> [[alternative HTML version deleted]]
    >> ______________________________________________
    >> R-devel using r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > -- 
    > Dr. Benjamin Bolker Professor, Mathematics & Statistics
    > and Biology, McMaster University Director, School of
    > Computational Science and Engineering Graduate chair,
    > Mathematics & Statistics x[DELETED ATTACHMENT external:
    > BenB_lm-subset.patch, plain text]
    > ______________________________________________
    > R-devel using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

More information about the R-devel mailing list