[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

Ben Bolker bbo|ker @end|ng |rom gm@||@com
Mon Dec 27 15:43:42 CET 2021


   I agree that it seems non-intuitive (I can't think of a design reason 
for it to look this way), but I'd like to stress that it's *not* an 
information leak; the predictions of the model are independent of the 
parameterization, which is all this issue affects. In a worst case there 
might be some unfortunate effects on numerical stability if the 
data-dependent bases are computed on a very different set of data than 
the model fitting actually uses.

   I've attached a suggested documentation patch (I hope it makes it 
through to the list, if not I can add it to the body of a message.)



On 12/26/21 8:35 PM, Balise, Raymond R wrote:
> Hello R folks,
> Today I noticed that using the subset argument in lm() with a polynomial gives a different result than using the polynomial when the data has already been subsetted. This was not at all intuitive for me.    You can see an example here: https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i
> 
>                  If this is a design feature that you don’t think should be fixed, can you please include it in the documentation and explain why it makes sense to figure out the orthogonal polynomials on the entire dataset?  This feels like a serous leak of information when evaluating train and test datasets in a statistical learning framework.
> 
> Ray
> 
> Raymond R. Balise, PhD
> Assistant  Professor
> Department of Public Health Sciences, Biostatistics
> 
> University of Miami, Miller School of Medicine
> 1120 N.W. 14th Street
> Don Soffer Clinical Research Center - Room 1061
> Miami, Florida 33136
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
Graduate chair, Mathematics & Statistics

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: subset_patch.txt
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20211227/47cc2f5a/attachment.txt>


More information about the R-devel mailing list