[R] Levels in new data fed to SVM

Tue Jan 8 21:14:25 CET 2013

Hi all,
I've encountered an issue using svm (e1071) in the specific case of
supplying new data which may not have the full range of levels that
were present in the training data.

I've constructed this really primitive example to illustrate the point:

> library(e1071)
> training.data <- data.frame(x = c("yellow","red","yellow","red"), a = c("alpha","alpha","beta","beta"), b = c("a", "b", "a", "c"))
> my.model <- svm(x ~ .,data=training.data)
> test.data <- data.frame(x = c("yellow","red"), a = c("alpha","beta"), b = c("a", "b"))
> predict(my.model,test.data)
Error in predict.svm(my.model, test.data) :
  test data does not match model !
>
> levels(test.data$b) <- levels(training.data$b)
> predict(my.model,test.data)
     1      2
yellow    red
Levels: red yellow

In the first case test.data$b does not have the level "c" and this
results in the input data being rejected. I've debugged this down to
the point of model matrix creation in the SVM R code. Once I fill up
the levels in the test data with the levels from the original data,
then there is no problem at all.

Assuming my test data has to come from another source where the number
of category levels seen might not always be as large as those for the
original training data, is there a better way I should be handling
this?

Thanks