[R] Decision Tree: Am I Missing Anything?

Fri Sep 21 18:42:50 CEST 2012

Max, I installed C50. I have a question about the syntax. Per the C50 manual:

## Default S3 method:
C5.0(x, y, trials = 1, rules= FALSE,
weights = NULL,
control = C5.0Control(),
costs = NULL, ...)

## S3 method for class ’formula’
C5.0(formula, data, weights, subset,
na.action = na.pass, ...)

I believe I need the method for class 'formula'. But I don't yet see in the manual how to tell C50 that I want to use that method. If I run:

respLevel = read.csv("Resp Level Data.csv")
respLevelTree = C5.0(BRAND_NAME ~ PRI + PROM + REVW + MODE + FORM + FAMI + DRRE + FREC + SPED, data = respLevel)

...I get an error message:

Error in gsub(":", ".", x, fixed = TRUE) : 
  input string 18 is invalid in this locale

What is the correct way to use the C5.0 method for class 'formula'?

-Vik

On Sep 21, 2012, at 4:18 AM, mxkuhn wrote:

> There is also C5.0 in the C50 package. It tends to have smaller trees that C4.5 and much smaller trees than J48 when there are factor predictors. Also, it has an optional feature selection ("winnow") step that can be used. 
> 
> Max
> 
> On Sep 21, 2012, at 2:18 AM, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:
> 
>> Hi,
>> 
>> just to add a few points to the discussion:
>> 
>> - rpart() is able to deal with responses with more than two classes. Setting method="class" explicitly is not necessary if the response is a factor (as in this case).
>> 
>> - If your tree on this data is so huge that it can't even be plotted, I wouldn't be surprised if it overfitted the data set. You should check for this and possibly try to avoid unnecessary splits.
>> 
>> - There are various ways to do so for J48 trees without variable reduction. One could require a larger minimal leaf size (default is 2) or one can use "reduced error pruning", see WOW("J48") for more options. They can be easily used as e.g. J48(..., control = Weka_control(R = TRUE,
>> M = 10)) etc.
>> 
>> - There are various other ways of fitting decision trees, see for example http://CRAN.R-project.org/view=MachineLearning for an overview. In particular, you might like the "partykit" package which additionally provides the ctree() method and has a unified plotting interface for ctree, rpart, and J48.
>> 
>> hth,
>> Z
>> 
>> On Thu, 20 Sep 2012, Vik Rubenfeld wrote:
>> 
>>> Bhupendrashinh, thanks very much!  I ran J48 on a respondent-level data set and got a 61.75% correct classification rate!
>>> 
>>> Correctly Classified Instances         988               61.75   %
>>> Incorrectly Classified Instances       612               38.25   %
>>> Kappa statistic                          0.5651
>>> Mean absolute error                      0.0432
>>> Root mean squared error                  0.1469
>>> Relative absolute error                 52.7086 %
>>> Root relative squared error             72.6299 %
>>> Coverage of cases (0.95 level)          99.6875 %
>>> Mean rel. region size (0.95 level)      15.4915 %
>>> Total Number of Instances             1600
>>> 
>>> When I plot it I get an enormous chart.  Running :
>>> 
>>>> respLevelTree = J48(BRAND_NAME ~ PRI + PROM + FORM + FAMI + DRRE + FREC + MODE + SPED + REVW, data = respLevel)
>>>> respLevelTree
>>> 
>>> ...reports:
>>> 
>>> J48 pruned tree
>>> ------------------
>>> 
>>> Is there a way to further prune the tree so that I can present a chart that would fit on a single page or two?
>>> 
>>> Thanks very much in advance for any thoughts.
>>> 
>>> 
>>> -Vik
>>>