[R] Levels in new data fed to SVM

Thu Jan 10 13:47:24 CET 2013

On 08.01.2013 21:14, Claus O'Rourke wrote:
> Hi all,
> I've encountered an issue using svm (e1071) in the specific case of
> supplying new data which may not have the full range of levels that
> were present in the training data.
>
> I've constructed this really primitive example to illustrate the point:
>
>> library(e1071)
>> training.data <- data.frame(x = c("yellow","red","yellow","red"), a = c("alpha","alpha","beta","beta"), b = c("a", "b", "a", "c"))
>> my.model <- svm(x ~ .,data=training.data)
>> test.data <- data.frame(x = c("yellow","red"), a = c("alpha","beta"), b = c("a", "b"))
>> predict(my.model,test.data)
> Error in predict.svm(my.model, test.data) :
>    test data does not match model !
>>
>> levels(test.data$b) <- levels(training.data$b)
>> predict(my.model,test.data)
>       1      2
> yellow    red
> Levels: red yellow
>
> In the first case test.data$b does not have the level "c" and this
> results in the input data being rejected. I've debugged this down to
> the point of model matrix creation in the SVM R code. Once I fill up
> the levels in the test data with the levels from the original data,
> then there is no problem at all.
>
> Assuming my test data has to come from another source where the number
> of category levels seen might not always be as large as those for the
> original training data, is there a better way I should be handling
> this?

You have to tell the factor about the possible levels, it does not 
necessarily contain examples.
That means:

levels(test.data$b) <- C("a", "b", "c")
predict(my.model,test.data)

will help.

Best,
Uwe Ligges

> Thanks
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>