[R] mx2 contingency tables or (2^(m-1)-1)'s 2x2 contingency tables in the context of feature selection for random forest

Thu Sep 28 19:52:03 CEST 2006

Dear Listers:

I have a categorical feature selection problem for random forest.

Suppose I have a multiple-leveled category variable A, which has m=3
levels: red, green, and blue and the final target is binary
classification.

I want to evaluate its power in discrimination between 2 classes. We
know rf splits multiple-leveled category variable by considering all
combinations of its levels. So suppose again I have 1000 such
multiple-leveled category variables and I need to do some feature
selection. Then I would like to try chi-sqr tests (or information
gain).

To match the splitting method used in rf, I am thinking if I should
simply use mx2 contingency table or (2^(m-1)-1)'s 2x2 contingency
tables in which I pick the best p-value to evaluate A's power. For the
latter, I am sure it is very alike the way used in rf. But is the
former good enough?

Thanks.
-- 
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III