[R] Chi2 algorithm - R

Luke Skywalker mattered91 at gmail.com
Wed Nov 23 17:08:38 CET 2016


Good evening,

I'm encountering a different kind of discretization with respect to the
1997 Liu and Setiono's one descripted in their papers, using Chi2 algorithm
for feature selection with discretization.

As stated in R documentation (discretization - R (from CRAN)
<https://cran.r-project.org/web/packages/discretization/discretization.pdf>),
R package discretizion offers the function Chi2, which comes to life in the
following papers:

Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization
of numeric attributes, Tools with Artificial Intelligence, 388–391.

Liu, H. and Setiono, R. (1997). Feature selection and discretization, IEEE
transactions on knowledge and data engineering, Vol.9, no.4, 642–645.

I wrote the following R programming language code, in which I have set
alpha and delta equal to the ones set in the papers above. Finally, the
following code prints out the discretized dataframe. I used Iris dataframe,
as in one of the examples in the two papers. The first paper above states
that alfa = 0.5 and delta = 5%, and that "the originally odd numbered data
are selected for training (75 patterns) and rest for testing (75
patterns)". With this asset, Sepal attributes should be removed.

library(discretization)
data(iris)
df1 <- iris[FALSE,]for(i in 1:nrow(iris)){
    if(i %% 2 != 0){
        df1 <- rbind(df1, iris[i,])
    }}
chi2(df1, alp=0.5, del=0.05)$Disc.data

The point is that, observing the dataframe printed out by the last
instruction, you can see that no attribute is removed. The discretized data
frame still have 4 attributes discretized: if I correctly understood the
above papers, Sepal Length and Sepal Width should have been both
discretized in just one interval by Chi2 algorithm.

I have posted a question here: http://stats.stackexchange.com/questions/
247499/why-does-not-r-chi2-algorithm-discretize-in-the-
same-manner-as-in-the-paper-by-l?noredirect=1#comment470974_247499.


Moreover, it's really hard to understand the cut points that Chi2 algorithm
implemented in R makes. For example:

res <- chi2(iris, 0.5, 0.05)

cut(iris$Sepal.Length, res$cutp, labels=FALSE) is different from
res$Disc.data$Sepal.Length

Help me understand, please

Best regards

	[[alternative HTML version deleted]]



More information about the R-help mailing list