[R] Formula in a model

Gerrit Eichner Gerrit.Eichner at math.uni-giessen.de
Thu Sep 12 09:53:30 CEST 2013


Hello, Paulito,

my comments are inline below:

> Thanks for the explanation. Let me give a specific example. Assume Temp 
> (column 4) is the output and the rest of the columns are input is the 
> training features. Note that I only use the air quality data for 
> illustration purpose. T input->output mapping may not make sense in the 
> real interpretation of this data.
>
> library(e1071)
>
> data(airquality)
> mytable=airquality
>
> colnames(mytable)=c('a','b','c','d','e','f')
>
> modelSVM1=svm(mytable[,6] ~ .,data=mytable)
> modelSVM2=svm(mytable[,-6],mytable[,6])
> modelSVM3=svm(f ~ ., data=mytable)
>
> predSVM1=predict(modelSVM1,newdata=mytable)
> predSVM2=predict(modelSVM2,newdata=mytable[,-6])
> predSVM3=predict(modelSVM3,newdata=mytable)
>
> Results of predSVM2 is similar with predSVM3  but different from predSVM1.

Well, because already modelSVM1 is different from the other two. This is 
due to how the "." on the rhs of a formula is interpreted. From the help 
page of formula:

 	"There are two special interpretations of . in a formula. The
 	usual one is in the context of a data argument of model fitting
 	functions and means 'all columns not otherwise in the formula':
 	see terms.formula. In the context of update.formula, only, it
 	means 'what was previously in this part of the formula'."

The first interpretation applies to your situation. With the formula for 
your modelSVM1 the function model.matrix() (which is called inside the 
formula version of svm()) creates a model matrix after looking for a 
column "mytable[,6]" in the data argument. And since there is no column 
with that name, it takes all columns of mytable (including the 6th, i.e., 
the one named "f"). See what model.matrix() does in that case:

> head( model.matrix(mytable[,6] ~ .,data=mytable), 3)
   (Intercept)  a   b    c  d e f
1           1 41 190  7.4 67 5 1
2           1 36 118  8.0 72 5 2
3           1 12 149 12.6 74 5 3



In the case of modelSVM3 model.matrix() does find column "f" in the data 
argument, and hence omits this column in forming the terms of the rhs of 
the formula:

> head( model.matrix( f ~ .,data=mytable), 3)
   (Intercept)  a   b    c  d e
1           1 41 190  7.4 67 5
2           1 36 118  8.0 72 5
3           1 12 149 12.6 74 5



The call to svm() for modelSVM2 is the (non-formula) default version and 
does not need to call model.matrix() because (so to say) it expects that 
the user has done that already by supplying the response to its argument y 
and the adequately formed data matrix to its argument x.


> Question: Which is the correct formulation?

The second and the third (for a sensible purpose), unless you want to 
experiment with svm() to see what happens if one does something rather 
nonsensical.


> Why R doesn't detect error/discrepancy in formulation?

Because R, or in this case rather the concept of a formula and the 
function model.matrix() are not designed to replace the user who knows 
what s/he is doing after having read the documentation. ;)



> If I use the same formulation with rpart using the same data:
>
> library(rpart)
>
> data(airquality)
> mytable=airquality
>
> colnames(mytable)=c('a','b','c','d','e','f')
>
> modelRP1=rpart(mytable[,6]~.,data=mytable,method='anova') # this works
> modelRP3=rpart(f ~ ., data=mytable,method='anova') # this works
>
> predRP1=predict(modelRP1,newdata=mytable)
> predRP3=predict(modelRP3,newdata=mytable)
>
>
> The results between predRP1 and predRP3 are different while the statements:
>
> predRP2=predict(modelRP2,newdata=mytable[,-6])
> modelRP2=rpart(mytable[,-6],mytable[,6],method='anova') 
>
> have errors.

This is presumably due to the same reasons as described above.


Remark: It is generally - for various reasons - recommended to use "<-" as 
the assignment operator, not "=". (And I like to recommend to use use 
blanks to increase readability of code.)

[... snip ...]


  I hope the fog has lifted  --  Gerrit


More information about the R-help mailing list