# [R] newdata for predict.lm() ??

peter dalgaard pd@|gd @end|ng |rom gm@||@com
Wed Nov 4 10:56:00 CET 2020

```Don't use \$ notation in lm() formulas. Use lm(w ~ h, data=DAT).

-pd

> On 4 Nov 2020, at 10:50 , Boris Steipe <boris.steipe using utoronto.ca> wrote:
>
> Can't get data from a data frame into predict() without a detour that seems quite unnecessary ...
>
> Reprex:
>
> # Data frame with simulated data in columns "h" (independent) and "w" (dependent)
> DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118,
>                            1.755, 2.060, 2.136, 2.126, 1.792, 1.574,
>                            2.117, 1.741, 2.295, 1.526, 1.666, 1.581,
>                            1.522, 1.995),
>                      w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429,
>                            87.176, 90.318, 76.873, 84.183, 57.890, 62.005,
>                            84.258, 78.317,101.304, 64.982, 71.237, 77.124,
>                            65.010, 81.413)),
>                 row.names = c( "1",  "2",  "3",  "4",  "5",  "6",  "7",
>                                "8",  "9", "10", "11", "12", "13", "14",
>                               "15", "16", "17", "18", "19", "20"),
>                 class = "data.frame")
>
>
> myFit <- lm(DAT\$w ~ DAT\$h)
> coef(myFit)
>
> # (Intercept)       DAT\$h
> #   11.76475    35.92002
>
>
> # Create 50 x-values with seq() to plot confidence intervals
> myNew <- data.frame(seq(min(DAT\$h), max(DAT\$h), length.out = 50))
>
> pc <- predict(myFit, newdata = myNew, interval = "confidence")
>
> # Warning message:
> # 'newdata' had 50 rows but variables found have 20 rows
>
> # Problem: predict() was not able to take the single column in myNew
> # as the independent variable.
>
> # Ugly workaround: but with that everything works as expected.
> xx <- DAT\$h
> yy <- DAT\$w
> myFit <- lm(yy ~ xx)
> coef(myFit)
>
> myNew <- data.frame(seq(min(DAT\$h), max(DAT\$h), length.out = 50))
> colnames(myNew) <- "xx"  # This fixes it!
>
> pc <- predict(myFit, newdata = myNew, interval = "confidence")
> str(pc)
>
> # So: specifying the column in newdata to have same name as the coefficient
> # name should work, right?
> # Back to the original ...
>
> myFit <- lm(DAT\$w ~ DAT\$h)
> colnames(myNew) <- "`DAT\$h`"
> # ... same error
>
> colnames(myNew) <- "h"
> # ... same error again.
>
> Bottom line: how can I properly specify newdata? The documentation is opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of the formula, which is unlikely to result in a useful column name, unless I assign to an intermediate variable. There must be a better way ...
>
>
>
> Thanks!
> Boris
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com

```