[R] Trouble with Caret and C5.0

Lorenzo Isella lorenzo.isella at gmail.com
Mon Aug 31 21:25:36 CEST 2015


Dear All,
I am trying to mine a small dataset.
Admittedly, it is a bit odd since it is an example of
multi-classification task where I have more than 300 different classes for about 600
observations.
Having said that, the problem is not the output of my script, but the
fact that it gets stuck, without an error message, when I use C5.0 and
caret.
I recycled another script of mine which never gave me any headache, so
I do not know what is going on.
The small training set can be downloaded from


https://www.dropbox.com/s/4yseukqqvssvh63/training.csv?dl=0


whereas I paste my script at the end of the email.
C5.0 without caret completes in seconds, so I must be making some
mistakes with Caret.
Any suggestion is appreciated.

Lorenzo

####################################################

library(caret)
library(readr)
library(C50)
library(doMC)
library(digest)


train <- read_csv("training.csv")

ncores <- 2


registerDoMC(cores = ncores)


set.seed(123)


shuffle <- sample(nrow(train))

train <- train[shuffle, ]


train$productid <- as.character(train$productid)

train$productid <- paste('fac', train$productid, sep='')

train$productid <- as.factor(train$productid)

train$State <- as.factor(train$State)

train$category <- as.factor(train$category)

train$unit <- as.factor(train$unit)

for (i in seq(nrow(train))){

train$myname[i] <- digest(train$myname[i], algo='crc32')

}


train <- subset(train, select=-c(straincategory, description))


### this completes quickly
oneTree <- C5.0(productid ~ ., data = train, trials=10)




c50Grid <- expand.grid(trials = c(10),
         model = c( "tree" ## ,"rules"
	                    ),winnow = c(## TRUE,
			                             FALSE ))




tc <- trainControl(method = "repeatedCV", summaryFunction=mnLogLoss,
                   number = 5, repeats = 5, verboseIter=TRUE,
                   classProbs=TRUE)



### but this takes forever
model <- train(productid~., data=train, method="C5.0", trControl=tc,
                              metric="logLoss",##
                              strata=train$donation,
			                     ## sampsize=rep(nmin,
                              length(levels(train$donation))),
			                     ## control =
                              C5.0Control(fuzzyThreshold = T),
			                     maximize=FALSE,
                              tuneGrid=c50Grid)



More information about the R-help mailing list