[R] newdata for predict.lm() ??

Boris Steipe bor|@@@te|pe @end|ng |rom utoronto@c@
Wed Nov 4 11:11:06 CET 2020


Solved. Thanks Achim and Peter ...

though following that approach we now are relying implicitly on column names. But at least I've got this silly example working now. Thanks for the explanation Achim.

:-)



> On 2020-11-04, at 20:05, Achim Zeileis <Achim.Zeileis using uibk.ac.at> wrote:
> 
> EXTERNAL EMAIL:  Treat content with extra caution.
> 
> On Wed, 4 Nov 2020, peter dalgaard wrote:
> 
>> Don't use $ notation in lm() formulas. Use lm(w ~ h, data=DAT).
> 
> ...or in any other formula for that matter!
> 
> Let me expand a bit on Peter's comment because this is really a pet peeve
> of mine:
> 
> The idea is that the formula "w ~ h" described the relationships between
> the variables involved, independent of the data set this should be applied
> to. In contrast "DAT$w ~ DAT$h" hard-wires the data into the formula and
> prevents it from applying the formula to another data set.
> 
> Hope that helps,
> Achim
> 
> 
>>> On 4 Nov 2020, at 10:50 , Boris Steipe <boris.steipe using utoronto.ca> wrote:
>>> 
>>> Can't get data from a data frame into predict() without a detour that seems quite unnecessary ...
>>> 
>>> Reprex:
>>> 
>>> # Data frame with simulated data in columns "h" (independent) and "w" (dependent)
>>> DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118,
>>>                           1.755, 2.060, 2.136, 2.126, 1.792, 1.574,
>>>                           2.117, 1.741, 2.295, 1.526, 1.666, 1.581,
>>>                           1.522, 1.995),
>>>                     w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429,
>>>                           87.176, 90.318, 76.873, 84.183, 57.890, 62.005,
>>>                           84.258, 78.317,101.304, 64.982, 71.237, 77.124,
>>>                           65.010, 81.413)),
>>>                row.names = c( "1",  "2",  "3",  "4",  "5",  "6",  "7",
>>>                               "8",  "9", "10", "11", "12", "13", "14",
>>>                              "15", "16", "17", "18", "19", "20"),
>>>                class = "data.frame")
>>> 
>>> 
>>> myFit <- lm(DAT$w ~ DAT$h)
>>> coef(myFit)
>>> 
>>> # (Intercept)       DAT$h
>>> #   11.76475    35.92002
>>> 
>>> 
>>> # Create 50 x-values with seq() to plot confidence intervals
>>> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
>>> 
>>> pc <- predict(myFit, newdata = myNew, interval = "confidence")
>>> 
>>> # Warning message:
>>> # 'newdata' had 50 rows but variables found have 20 rows
>>> 
>>> # Problem: predict() was not able to take the single column in myNew
>>> # as the independent variable.
>>> 
>>> # Ugly workaround: but with that everything works as expected.
>>> xx <- DAT$h
>>> yy <- DAT$w
>>> myFit <- lm(yy ~ xx)
>>> coef(myFit)
>>> 
>>> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
>>> colnames(myNew) <- "xx"  # This fixes it!
>>> 
>>> pc <- predict(myFit, newdata = myNew, interval = "confidence")
>>> str(pc)
>>> 
>>> # So: specifying the column in newdata to have same name as the coefficient
>>> # name should work, right?
>>> # Back to the original ...
>>> 
>>> myFit <- lm(DAT$w ~ DAT$h)
>>> colnames(myNew) <- "`DAT$h`"
>>> # ... same error
>>> 
>>> colnames(myNew) <- "h"
>>> # ... same error again.
>>> 
>>> Bottom line: how can I properly specify newdata? The documentation is opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of the formula, which is unlikely to result in a useful column name, unless I assign to an intermediate variable. There must be a better way ...
>>> 
>>> 
>>> 
>>> Thanks!
>>> Boris
>>> 
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Office: A 4.23
>> Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list