[Rd] Some potential changes (enhancements) to formulas and models

Robert Gentleman rgentlem@jimmy.harvard.edu
Mon, 18 Dec 2000 14:17:06 -0500

Here is part 1 of my long saga towards a more flexible modeling
Comments and hints are especially welcome.

The simple version:
  Starting with a formula and data R goes through 3 main steps to get
the data into a form suitable for fitting.

 1) application of terms
 2) application of model.frame
    (subset and na.action occur in 2).
 3) application of model.matrix

 To be concrete think of the following two formulas

F1:  y~a*log(b)

F2:  y~a*(1+ exp(b*t))

 My goal is to introduce a meaningful way of specifying which symbols
are parameters and which are data.

 For now I'm just going to talk about the terms function and within
 just about the factors component that is returned. I've had a look at
 the man pages and the White book (ch 2).

  The factors component is supposed to be a matrix with the variable
  along the rows and the terms (whatever they are) as columns.

   If in F1 both a and b are variables then the terms are
    terms:  a, log(b), a:log(b)
  the variable names are
     vars: y, a, log(b)

   If a or b is a parameter (say it's a) then
   terms: log(b)
   vars: y, log(b)

  I think it would be easier if log(b) was in the terms  but b
  was in the vars.

    With F2, things aren't so simple:
     currently we get,

y ~ a + exp(b * t) + a:exp(b * t)
list(y, a, exp(b * t))
           a exp(b * t) a:exp(b * t)
y          0          0            0
a          1          0            1
exp(b * t) 0          1            1

  Which is ok, given that we haven't said anything about the variables
 but it certainly won't help us to build a model.
  Now suppose I want to identify a and b as parameters,
then I want to get:
  y ~ a * ( 1 + exp( b * t ) )

  terms:  t
  vars: y, t

   If just a is a parameter then I think that we should get

   terms: t, b, t:b
or terms: y, exp(b*t)
   vars: y, t, b

  Open questions:
 1) When do the special formula operators work and when do they take
    usual interpretation?

 2) I think that model.frame should produce a dataframe with
       columns corresponding to the variables in the model.
     - model.matrix is then responsible for using the model frame
      and the terms to produce a model.matrix

     In the F1 example, then under this scheme the model frame
     would contain y a and log(b)
     The model matrix would have a and log(b) in it.
     If b appears on its own and inside a function call does that
     correspond to two variables or one?

 3) In some sense we need define what a term is and
    what a variable is. We need to do that in a way that is meaningful
    for both linear and nonlinear (and possible graphical) models.

 4) Is the notion of model.matrix useful outside of linear models?
    If so, is it the place where we code up the contrasts

| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: not yet                          |
| Harvard School of Public Health  email: rgentlem@jimmy.dfci.harvard.edu   |
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch