# [Rd] Some potential changes (enhancements) to formulas and models

Robert Gentleman rgentlem@jimmy.harvard.edu
Mon, 18 Dec 2000 14:17:06 -0500

```Here is part 1 of my long saga towards a more flexible modeling
Comments and hints are especially welcome.

The simple version:
Starting with a formula and data R goes through 3 main steps to get
the data into a form suitable for fitting.

1) application of terms
2) application of model.frame
(subset and na.action occur in 2).
3) application of model.matrix

To be concrete think of the following two formulas

F1:  y~a*log(b)

F2:  y~a*(1+ exp(b*t))

My goal is to introduce a meaningful way of specifying which symbols
are parameters and which are data.

For now I'm just going to talk about the terms function and within
that
just about the factors component that is returned. I've had a look at
the man pages and the White book (ch 2).

The factors component is supposed to be a matrix with the variable
names
along the rows and the terms (whatever they are) as columns.

If in F1 both a and b are variables then the terms are
terms:  a, log(b), a:log(b)
the variable names are
vars: y, a, log(b)

If a or b is a parameter (say it's a) then
terms: log(b)
vars: y, log(b)

I think it would be easier if log(b) was in the terms  but b
was in the vars.

With F2, things aren't so simple:
currently we get,

y ~ a + exp(b * t) + a:exp(b * t)
attr(,"variables")
list(y, a, exp(b * t))
attr(,"factors")
a exp(b * t) a:exp(b * t)
y          0          0            0
a          1          0            1
exp(b * t) 0          1            1

Which is ok, given that we haven't said anything about the variables
but it certainly won't help us to build a model.
Now suppose I want to identify a and b as parameters,
then I want to get:
y ~ a * ( 1 + exp( b * t ) )

terms:  t
vars: y, t

If just a is a parameter then I think that we should get

terms: t, b, t:b
or terms: y, exp(b*t)
vars: y, t, b

Open questions:
1) When do the special formula operators work and when do they take
their
usual interpretation?

2) I think that model.frame should produce a dataframe with
columns corresponding to the variables in the model.
- model.matrix is then responsible for using the model frame
and the terms to produce a model.matrix

In the F1 example, then under this scheme the model frame
would contain y a and log(b)
The model matrix would have a and log(b) in it.
If b appears on its own and inside a function call does that
correspond to two variables or one?

3) In some sense we need define what a term is and
what a variable is. We need to do that in a way that is meaningful
for both linear and nonlinear (and possible graphical) models.

4) Is the notion of model.matrix useful outside of linear models?
If so, is it the place where we code up the contrasts

Robert
--
+---------------------------------------------------------------------------+
| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: not yet                          |
| Harvard School of Public Health  email: rgentlem@jimmy.dfci.harvard.edu   |
+---------------------------------------------------------------------------+
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

```