Collapsing solution to the question discussed above: Re: [R] multi-class classification using rpart

Huntsinger, Reid reid_huntsinger at merck.com
Tue Jan 25 22:45:34 CET 2005


You could break your 3 class problem into several (2 or 3) 2 class problems,
and then use Andy's suggestion (see the CART book). There are several ways
to break the problem into 2 class problems, and several ways to combine the
resulting classifiers. Tom Dietterich, Jerry Friedman, Trevor Hastie and Rob
Tibshirani, among others, have articles on the question, in places like
Annals of Statistics, Machine Learning from the mid-to-late 90s. 

Alternatively, or in addition, you could look at the simulated annealing
approach to searching for a good split for a categorical variable in
Quinlan's C4.5 book and implement that in R. 

There are also many ways to create "indices" to use in place of the
categorical variable. These often depend on some hierarchical structure,
like with SIC codes.

Reid Huntsinger



-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
Sent: Tuesday, January 25, 2005 3:59 PM
To: Uwe Ligges
Cc: R-help at stat.math.ethz.ch; Liaw, Andy
Subject: Collapsing solution to the question discussed above: Re: [R]
multi-class classification using rpart


Hi, All:
The variable is used to encode industries: like computer science,
electronics and so on. Therefore, there is no order in them.

My previous effforts indicate that grouping  them according to some
domain knowledge decreases the accuracy. However, using some
"distance" or "entropy" is my current thought to collapse them since
it is a classification problem. I am searching for some papers which
discussed on this topic.

Anyone has more ideas or info like paper?

Thanks.

Ed


On Tue, 25 Jan 2005 21:49:26 +0100, Uwe Ligges
<ligges at statistik.uni-dortmund.de> wrote:
> WeiWei Shi wrote:
> 
> > Hi, Andy:
> > Thanks. It works after I removed the variable. I think I got a similar
> > problem when I used randomForest. And I am not sure if they were due
> > to the same reason.
> >
> > Practically and Unfortunately, that variable is very important to the
> > accuracy. I am wondering if there is another way besides collapsing
> > it. BTW, I remember you mentioned some alternative implementation to
> > randomForest (the author provided) to avoid the upper limit (32, if I
> > am correct) for the level of factor which can be used in the R
> > version's randomForest.
> >
> > Thanks for further assistance!
> 
> 
> So you *really* want it to be factor?! Thought it was a mistake not to
> have it numerical....
> Amazing! Maybe computers are sometimes even too fast these days.
> 
> Uwe
> 
> 
> > Ed
> >
> > On Tue, 25 Jan 2005 14:58:04 -0500, Liaw, Andy <andy_liaw at merck.com>
wrote:
> >
> >>>From: WeiWei Shi
> >>>
> >>>Hi,
> >>>I am trying to make a multi-class classification tree by using rpart.
> >>>I used MASS package'd data: fgl to test and it works well.
> >>>
> >>>However, when I used my small-sampled data as below, the program seems
> >>>to take forever. I am not sure if it is due to slowness or there is
> >>>something wrong with my codes or data manipulation.
> >>>
> >>>Please be advised !
> >>>
> >>>The data is described as the output from str() function. The call to
> >>>rpart is like:
> >>>
> >>>library(rpart)
> >>>test_tree<-rpart(x$V142 ~ ., data=x,
> >>>parms=list(split='gini'), cp =0.01)
> >>>
> >>>the response variable is $V142, with 3 levels.
> >>>
> >>>Thanks for your suggestions!
> >>>
> >>>Ed.
> >>
> >>[snip]
> >>
> >>
> >>> $ V141: Factor w/ 88 levels "1001","1002",..: 59 59 59 59 59
> >>>59 55 78 7 73 ...
> >>
> >>I'd bet this is the problem.  There are 2^(88-1) - 1 possible ways to
split
> >>a factor with 88 levels.  It will work on those splits til the cows come
> >>home...
> >>
> >>I'd suggest getting rid of that variable, or collapse the levels to
> >>something more reasonable.  The CART book describes some heuristic
shortcuts
> >>for testing only n-1 splits for factors with n levels, but I believe
that
> >>only works for 2-class problems, if I'm not mistaken.
> >>
> >>Andy
> >>
>
>>--------------------------------------------------------------------------
----
> >>Notice:  This e-mail message, together with any attachment...{{dropped}}
> >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
>

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list