[Rd] crossvalidation in svm regression in e1071 gives incorre ct results (PR#8554)
    Liaw, Andy 
    andy_liaw at merck.com
       
    Thu Feb  2 17:28:40 CET 2006
    
    
  
1. This is _not_ a bug in R itself.  Please don't use R's bug reporting
system for contributed packages.
2. This is _not_ a bug in svm() in `e1071'.  I believe you forgot to take
sqrt.
3.  You really should use the `tot.MSE' component rather than the mean of
the `MSE' component, but this is only a very small difference.
So, instead of spread[i] <- mean(mysvm$MSE), you should have spread[i] <-
sqrt(mysvm$tot.MSE).  I get:
> spread <- rep(0,20)
> for (i in 1:20) {
+     spread[i] <- svm(y ~ x,data, cross=10)$tot.MSE
+ }
> summary(sqrt(spread[i]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2679  0.2679  0.2679  0.2679  0.2679  0.2679 
Andy 
From: no228 at cam.ac.uk
> 
> Full_Name: Noel O'Boyle
> Version: 2.1.0
> OS: Debian GNU/Linux Sarge
> Submission from: (NULL) (131.111.8.96)
> 
> 
> (1) Description of error
> 
> The 10-fold CV option for the svm function in e1071 appears 
> to give incorrect
> results for the rmse.
> 
> The example code in (3) uses the example regression data in the svm
> documentation. The rmse for internal prediction is 0.24. It 
> is expected the
> 10-fold CV rmse should be bigger, but the result obtained 
> using the "cross=10"
> option is 0.07. When the 10-fold CV is conducted either 'by 
> hand' (not shown
> below) or using the errorest function in ipred (shown below) 
> the answer is
> closer to 0.27, a more reasonable value.
> 
> (2) Description of system
> 
> I'm using the Debian Sarge version of R:
>    R : Copyright 2005, The R Foundation for Statistical Computing
>    Version 2.1.0  (2005-04-18), ISBN 3-900051-07-0
> 
> svm is in the e1071 package from CRAN:
>    Version: 1.5-11
>    Date: 2005-09-19
> 
> (3) Example code illustrating the problem
> 
> library(e1071)
> 
> set.seed(42)
> # create data
> x <- seq(0.1, 5, by = 0.05)
> y <- log(x) + rnorm(x, sd = 0.2)
> data <- as.data.frame(cbind(y,x))
> 
> # estimate model and predict input values
> mysvm   <- svm(y ~ x,data)
> result <- predict(mysvm, data)
> (rmse <- sqrt(mean((result-data[,1])**2)))
> # 0.2390489
> 
> # built-in 10-fold CV estimate of prediction error
> spread <- rep(0,20)
> for (i in 1:20) {
>     mysvm <- svm(y ~ x,data,cross=10)
>     spread[i] <- mean(mysvm$MSE)
>     }
> summary(spread)
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> # 0.06789 0.07089 0.07236 0.07310 0.07411 0.08434 (or 
> something similar)
> 
> # 10-fold CV using errorest
> library(ipred)
> mysvm <- function(formula,data) {
>   model <- svm(formula,data)
>   function(newdata) predict(model,newdata)
>   }
> spread <- rep(0,20)
> for (i in 1:20) {
>   spread[i] <- errorest(y ~ x, data, model=mysvm)$error
> }
> summary(spread)
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> #  0.2601  0.2649  0.2673  0.2696  0.2741  0.2927
> 
> 
> Regards,
>  Noel O'Boyle.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>
    
    
More information about the R-devel
mailing list