[R] Fundamental formula and dataframe question.

Mon May 12 18:03:18 CEST 2008

I would have thought that:

> lm( C1 ~ M^2, data=DF )

Would give the main effects and 2 way interaction(s) (but a quick test did not match my expectation).  Possibly a feature request is in order if people plan to use this a lot.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
(801) 408-8111

> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Ted Harding
> Sent: Sunday, May 11, 2008 2:07 PM
> To: Myers, Brent
> Cc: r-help at r-project.org
> Subject: Re: [R] Fundamental formula and dataframe question.
>
> On 11-May-08 18:58:45, Myers, Brent wrote:
> > There is a very useful and apparently fundamental feature
> of R (or of
> > the package pls) which I don't understand.
> >
> > For datasets with many independent (X) variables such as
> chemometric
> > datasets there is a convenient formula and dataframe
> construction that
> > allows one to access the entire X matrix with a single term.
> >
> > Consider the gasoline dataset available in the pls package. For the
> > model statement in the plsr function one can write: Octane ~ NIR
> >
> > NIR refers to a (wide) matrix which is a portion of a
> dataframe. The
> > naming of the columns is of the form: 'NIR.xxxx nm'
> >
> > names(gasoline) returns...
> >
> > $names
> > [1] "octane" "NIR"
> >
> > instead of...
> >
> > $names
> > [1] "octane" "NIR.1000 nm" "NIR.1001 nm" ...
> >
> > How do I construct and manipulate such dataframes and the
> column names
> > that go with?
> >
> > Does the use of these types of formulas and dataframes
> generalize to
> > other modeling functions?
> >
> > Some specific clues on a help search might be enough, I've
> tried many.
> >
> > Regards,
> > Brent
>
> I don't have the 'gasoline' dataset to hand, but I can
> produce something to which your descrption applies as follows:
>
>   C1 <- c(1.1,1.2,1.3,1.4)
>   C2 <- c(2.1,2.2,2.3,2.4)
>    M <- cbind(M1=c(11.1,11.2,11.3,11.4),
>               M2=c(12.1,12.2,12.3,12.4))
>   DF <- data.frame(C1=C1,C2=C2,M=M)
>   DF
> #    C1  C2 M.M1 M.M2
> # 1 1.1 2.1 11.1 12.1
> # 2 1.2 2.2 11.2 12.2
> # 3 1.3 2.3 11.3 12.3
> # 4 1.4 2.4 11.4 12.4
>
> so the two columns C1 and C2 have gone in as named, and the
> matrix M (with named columns M1 and M2) has gone in with
> columns M.M1, M.M2
>
> Now let's fuzz the numbers a bit, so that the lm() fit makes sense:
>
>   C1 <- C1 + round(0.1*runif(4),2)
>   C1 <- C1 + round(0.1*runif(4),2)
>    M <- cbind(M1=c(11.1,11.2,11.3,11.4),
>               M2=c(12.1,12.2,12.3,12.4)) +
>         round(0.1*runif(8),2)
>   DF <- data.frame(C1=C1,C2=C2,M=M)
>   DF
> #     C1  C2  M.M1  M.M2
> # 1 1.21 2.1 11.19 12.13
> # 2 1.34 2.2 11.23 12.23
> # 3 1.38 2.3 11.36 12.30
> # 4 1.50 2.4 11.43 12.48
>
>   summary(lm(C1 ~ M),data=DF)
> # Call:
> # lm(formula = C1 ~ M)
> # Residuals:
> #        1        2        3        4
> # -0.02422  0.02448  0.01309 -0.01335
> # Coefficients:
> #             Estimate Std. Error t value Pr(>|t|)
> # (Intercept) -8.28435    2.48952  -3.328    0.186
> # MM1         -0.05411    0.66909  -0.081    0.949
> # MM2          0.83463    0.50687   1.647    0.347
> # Residual standard error: 0.03919 on 1 degrees of freedom
> # Multiple R-Squared: 0.9642,     Adjusted R-squared: 0.8925
> # F-statistic: 13.46 on 2 and 1 DF,  p-value: 0.1893
>
> In other words, a perfectly standard LM fit, equivalent to
>
>   summary(lm(C1 ~ M[,1]+M[,2]))
>
> (as you can check). So all that looks straightforward.
>
> One thing, however, is not clear to me in this scenario.
> Suppose, for example, that the columns M1 and M2 of M were
> factors (and that you had more rows than I've used above, so
> that the fit is non-trivial).
>
> Then, in the standard specification of an LM, you could write
>
>   summary(lm(C1 ~ M[,1]*M[,2]))
>
> and get the main effects and interactions. But how would you
> do that in the other type of specification:
>
> Where you used
>   summary(lm(C1 ~ M, data=DF))
> to get the equivalent of
>   summary(lm(C1 ~ M[,1]+M[,2]))
> what would you use to get the equivalent of
>   summary(lm(C1 ~ M[,1]*M[,2]))??
>
> Would you have to "spell out" the interaction term[s] in
> additional columns of M?
>
> Hmmm, interesting! I hadn't been aware of this aspect of
> formula and dataframe construction for modellinng, until you
> pointed it out!
>
> Best wishes,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 11-May-08                                       Time: 21:06:49
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>