[R] SVM differences between R, Weka, Python

Mon Aug 12 18:36:01 CEST 2013

Hi,
I'm studying SVMs and found that if I run SVM in R, Weka, Python their results are differ. So, to eliminate  possible pitfalls, I decided to use standard iris dataset and wrote implementation in R, Weka, Python for the same SVM/kernel. I think the choice of kernel does not matter and only needs to be consistent among implementations. I excluded cross validation since python does not have it and tried to keep consistent set of input parameters among all implementations (I went through them all and the defaults seems consistent). So, the Weka and Python both produced identical confusion matrix, but R results stays apart (I tried both e1071 and kerblab, they consistent among each other, but differ from Weka/Python). That's why I decided to post my message to R community and ask for help to identify the "problem" (if any) or get reasonable explanation why R results can differ. Please note that all implementation uses libsvm underneath (at least that what I got from reading), so I would expect results to be the same. I understand that seeds may differ, but I used entire dataset without any sampling, may be there is internal normalization?

I'm posting the code for all implementations along with confusion matrix outputs. Feel free to reproduce and comment.

Thanks,
Valentin.

Weka:
--------------------------------------------------
#!/usr/bin/env bash
# set path to Weka
export CLASSPATH=/Applications/weka-3-6-9.app/Contents/Resources/Java/weka.jar
data=./iris.arff
kernel="weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"
c=1.0
t=0.001
# -V The number of folds for the internal cross-validation. (default -1, use training data)
# -N Whether to 0=normalize/1=standardize/2=neither. (default 0=normalize)
# -W The random number seed. (default 1)
#opts="-C $c -L $t -N 2 -V -1 -W 1"
opts="-C $c -L $t -N 2"
cmd="java weka.classifiers.functions.SMO"
if [ "$1" == "help" ]; then
    $cmd
    exit 0
fi
$cmd $opts -K "$kernel" -t $data

--------------------------------------------------

  a  b  c   <-- classified as
 50  0  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  5 45 |  c = Iris-virginica

Python:
--------------------------------------------------
from sklearn import svm
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def report(clf, x_test, y_test):
    y_pred = clf.predict(x_test)
    print clf
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

def classifier():
    # import some data to play with
    iris = datasets.load_iris()
    x_train = iris.data
    y_train = iris.target
    regC = 1.0  # SVM regularization parameter
    clf = svm.SVC(kernel='rbf', gamma=0.01, C=regC).fit(x_train, y_train)
    report(clf, x_train, y_train)

if __name__ == '__main__':
    classifier()
--------------------------------------------------

[[50  0  0]
 [ 0 47  3]
 [ 0  5 45]]

R:
--------------------------------------------------
library(kernlab)
library(e1071)

# load data
data(iris)

# run svm algorithm (e1071 library) for given vector of data and kernel
model <- svm(Species~., data=iris, kernel="radial", gamma=0.01)
print(model)
# the last column of this dataset is what we'll predict, so we'll exclude it
prediction <- predict(model, iris[,-ncol(iris)])
# the last column is what we'll check against for
tab <- table(pred = prediction, true = iris[,ncol(iris)])
print(tab)
cls <- classAgreement(tab)
msg <- sprintf("Correctly classified: %f, kappa %f", cls$diag, cls$kappa)
print(msg)
--------------------------------------------------

            true
pred         setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46        11
  virginica       0          4        39
[1] "Correctly classified: 0.900000, kappa 0.850000"