[R] Formula in a model

Thu Sep 12 15:44:04 CEST 2013

Hi Gerrit,

Thank you very much for the precise explanation. 

Syntactically, I thought R is smart enough to detect that I'm using one of the columns because I use data=mytable syntax which means that input/output information are in the mytable. 

For a generic support, I think it's wise to support this syntax: genericModel(table[,columnLists] ~ ., data=table) because in many cases where you have hundred's of columns, you don't know the header but you know the column position of your inputs and outputs. You may say that why not use genericModel(table[,inputColumns],table[,outputColumns])? The formula expression shows more flexibility and elegance. Can this become a feature in the future? or at least R can be smart enough to detect that the output column is part of the input column.

I'm not sure how many will have a mistake of using this expression in the future specially in dealing with many columns and the easiest way to access it is by column number instead of headers. It can be sensible when you understand how R interprets it but syntactically, it makes sense to have the expression: mutable[,outputColumns] ~ .

Regards,
Paulito

----- Original Message -----
From: Gerrit Eichner <Gerrit.Eichner at math.uni-giessen.de>
To: Paulito Palmes <ppalmes at yahoo.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Thursday, 12 September 2013, 8:53
Subject: Re: [R] Formula in a model

Hello, Paulito,

my comments are inline below:

> Thanks for the explanation. Let me give a specific example. Assume Temp 
> (column 4) is the output and the rest of the columns are input is the 
> training features. Note that I only use the air quality data for 
> illustration purpose. T input->output mapping may not make sense in the 
> real interpretation of this data.
>
> library(e1071)
>
> data(airquality)
> mytable=airquality
>
> colnames(mytable)=c('a','b','c','d','e','f')
>
> modelSVM1=svm(mytable[,6] ~ .,data=mytable)
> modelSVM2=svm(mytable[,-6],mytable[,6])
> modelSVM3=svm(f ~ ., data=mytable)
>
> predSVM1=predict(modelSVM1,newdata=mytable)
> predSVM2=predict(modelSVM2,newdata=mytable[,-6])
> predSVM3=predict(modelSVM3,newdata=mytable)
>
> Results of predSVM2 is similar with predSVM3  but different from predSVM1.

Well, because already modelSVM1 is different from the other two. This is 
due to how the "." on the rhs of a formula is interpreted. From the help 
page of formula:

    "There are two special interpretations of . in a formula. The
    usual one is in the context of a data argument of model fitting
    functions and means 'all columns not otherwise in the formula':
    see terms.formula. In the context of update.formula, only, it
    means 'what was previously in this part of the formula'."

The first interpretation applies to your situation. With the formula for 
your modelSVM1 the function model.matrix() (which is called inside the 
formula version of svm()) creates a model matrix after looking for a 
column "mytable[,6]" in the data argument. And since there is no column 
with that name, it takes all columns of mytable (including the 6th, i.e., 
the one named "f"). See what model.matrix() does in that case:

> head( model.matrix(mytable[,6] ~ .,data=mytable), 3)
   (Intercept)  a   b    c  d e f
1           1 41 190  7.4 67 5 1
2           1 36 118  8.0 72 5 2
3           1 12 149 12.6 74 5 3

In the case of modelSVM3 model.matrix() does find column "f" in the data 
argument, and hence omits this column in forming the terms of the rhs of 
the formula:

> head( model.matrix( f ~ .,data=mytable), 3)
   (Intercept)  a   b    c  d e
1           1 41 190  7.4 67 5
2           1 36 118  8.0 72 5
3           1 12 149 12.6 74 5

The call to svm() for modelSVM2 is the (non-formula) default version and 
does not need to call model.matrix() because (so to say) it expects that 
the user has done that already by supplying the response to its argument y 
and the adequately formed data matrix to its argument x.

> Question: Which is the correct formulation?

The second and the third (for a sensible purpose), unless you want to 
experiment with svm() to see what happens if one does something rather 
nonsensical.

> Why R doesn't detect error/discrepancy in formulation?

Because R, or in this case rather the concept of a formula and the 
function model.matrix() are not designed to replace the user who knows 
what s/he is doing after having read the documentation. ;)

> If I use the same formulation with rpart using the same data:
>
> library(rpart)
>
> data(airquality)
> mytable=airquality
>
> colnames(mytable)=c('a','b','c','d','e','f')
>
> modelRP1=rpart(mytable[,6]~.,data=mytable,method='anova') # this works
> modelRP3=rpart(f ~ ., data=mytable,method='anova') # this works
>
> predRP1=predict(modelRP1,newdata=mytable)
> predRP3=predict(modelRP3,newdata=mytable)
>
>
> The results between predRP1 and predRP3 are different while the statements:
>
> predRP2=predict(modelRP2,newdata=mytable[,-6])
> modelRP2=rpart(mytable[,-6],mytable[,6],method='anova') 
>
> have errors.

This is presumably due to the same reasons as described above.

Remark: It is generally - for various reasons - recommended to use "<-" as 
the assignment operator, not "=". (And I like to recommend to use use 
blanks to increase readability of code.)

[... snip ...]

  I hope the fog has lifted  --  Gerrit