[R] Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

varin sacha v@r|n@@ch@ @end|ng |rom y@hoo@|r
Wed Aug 7 15:09:02 CEST 2019


Dear Experts,

I have fitted MARS and GAM models on a real dataset. My goal is prediction. I have run crossvalidation many times to get an idea of the out-of-bag accuracy value. I use the Mean Squared Error (MSE) as an error evaluation criterion. I have published my paper and the reviewers ask me to do simulations.
So, my goal is now to do simulations as simulation studies may be a better alternative for objectively comparing the performances of these 2 algorithms. My goal is to figure out which method (GAM or MARS) performs better (minimizing MSE) in what circumstances.
I want to consider 3 different factors : n (sample size) ; the presence of Y-outliers and the presence of missing data (X-data).
I want to know the influence of the sample size, the influence of the percentage of Y-outliers and the influence of the percentage of X missing data.

Sample size : n=50 ; n=100 ; n=200; n=300 and n=500
Y-outliers : 10% of Y-outliers ; 20% of Y-outliers ; 30% of Y-outliers ; 40% of Y-outliers and 50% of Y-outliers
Missing data : 10% of X missing data ; 20% of X missing data ; 30% of X missing data ; 40% of X missing data and 50% of X missing data

Here below are the reproducible R codes for GAM and MARS I use to calculate the MSE running cross-validation many times. 
How can I modify my R codes to simulate the sample size, the presence of Y-outliers and the presence of missing data ?

###MSE CROSSVALIDATION GAM (gam1)
install.packages("ISLR")
library(ISLR)
install.packages("mgcv")
library(mgcv)
 
set.seed(123)
# Create a list to store the results
lst<-list()
 
# This statement does the repetitions (looping)
for(i in 1 :1000){
 
n=dim(Wage)[1]
 
p=0.667
 
sam=sample(1 :n,floor(p*n),replace=FALSE)
 
Training =Wage [sam,]
Testing = Wage [-sam,]
 
GAM1<-gam(wage ~education+s(age,bs="ps")+year,data=Wage)
 
ypred=predict(GAM1,newdata=Testing)
y=Testing$wage

MSE = mean((y-ypred)^2)
MSE
lst[i]<-MSE
}
mean(unlist(lst))
########

#####MSE CROSSVALIDATION MARS (Mars1)
install.packages("ISLR")
library(ISLR)
install.packages("earth")
library(earth)

set.seed(123)
# Create a list to store the results
lst<-list()
 
# This statement does the repetitions (looping)
for(i in 1 :1000){
 
n=dim(Wage)[1]
 
p=0.667
 
sam=sample(1 :n,floor(p*n),replace=FALSE)
 
Training =Wage [sam,]
Testing = Wage [-sam,]
 
mars1 <- earth(wage~age+as.factor(education)+year, data=Wage)
 
ypred=predict(mars1,newdata=Testing)
y=Testing$wage

MSE = mean((y-ypred)^2)
MSE
lst[i]<-MSE
}
mean(unlist(lst))
#########



More information about the R-help mailing list