[Rd] Advice on package design for handling of dots in a formula

Thu Oct 16 18:11:25 CEST 2014

There is the issue of best design and the issue of dots, which I think are separate.

As to the dots, I don't think there is any way out but to handle it yourself.  The formula 
parser has defined "." to mean everything in the frame that is not listed in the response. 
  For good or ill it allows one to type y~ log(age) + . and get a model that has both 
log(age) and age --- perhaps that is what the user wanted.

Only you know that having strata(x) and x both on the right hand side does not make sense.

I have never been sympathetic to the use of ., I suppose because it never applies to my 
own data sets.  My data always contain idenifier variables: subject id, address, 
enrollment date, etc, which would never be used in a fit.
Use of "." simply never occurs outside of toy examples.  My primary advice would be to 
stop worrying about it.  (Or prehaps give me a context of why you do need to use it.)

Beyond that, a couple of design comments:
  1. When an option refers to only a single variable, there is not need for a "~", and in 
fact things are
easier without it.  Look for example at the etastart option in glm().  I think we should 
use this more.  If coxph were being rewritten today the cluster(id) term now used in a 
formula to signal grouping would instead be an
id= option.

  2. I like the idea of marking variables in the formula, like strata() does in coxph. 
The variable is part of the prediction but plays a different role.  I also now prefer 
setting those up so that they are not global variables, i.e. tt() makes sense only within 
the coxph call.  It took me a long time to see exactly how to do this, you will find the 
example code in coxph.  If redoing things today, strata() would be local as well.

  3. Make the formula and call easy for the user, even if you have to do more work.  This 
was the approach taken in coxme, which tears it apart and reassembles.

If you intend to study coxph, then you should pull up the file "sourcecode.pdf", found in 
the "doc" directory of the installed survival library.  It has a lot more comments about 
my design decisions.  Certainly do this if you want to emulate the custom formula 
processing of coxme, though for that document you'll need to grab the source code and do 
"make all.pdf" in its noweb directory.

Terry Therneau

On 10/16/2014 05:00 AM, r-devel-request at r-project.org wrote:
> I am working on a new package, one in which the user needs to specify the
> role that different variables play in the analysis. Where I'm stumped is the
> best way to have users specify those roles.
>
> Approach #1: Separate formula for each special component
>
> First I thought to have users specify each formula separately, like:
>
> new.function(formula=y~X1+X2+X3,
>               weights=~w,
>               observationID=~ID,
>               strata=~site,
>               data=mydata)
>
> This seems to be a common approach in other packages. However, one of my
> testers noted that if he put formula=y~. then w, ID, and site showed up in
> the model where they weren't supposed to be. I could add some code to try to
> prevent that (string matching and editing the terms object, perhaps?), but
> that seemed a little clumsy to me.
...  rest of note not copied