[R] visualization of KNN results in text classification

Fri May 12 15:35:00 CEST 2017

> On 12 May 2017, at 15:30, Elahe chalabi <chalabi.elahe at yahoo.de> wrote:
> 
> 
> 
> Thanks for your reply. What I exactly have is a data frame with rows containing words which have been used in each speech and columns containing frequency of these words, I have an extra row showing the type of the speech whether it was from a control group or Alzheimer group. Then I create a training and test set for KNN from this data frame and by KNN I classify the speeches which assigns every speech (actually text of the speech!) to the correct type of group, if it's from control group or Alzheimer group. 
> Now my question is how can I visualize my KNN classifier or its results? cause now I only have an accuracy matrix from KNN!
> 
> Thanks for any help!
> Elahe 

It would be very helpful if you create a minimal example to understand your data and what you have done with. Yes, you explained your data by your words but it’s still unclear. So, I created a minimal example instead of you.

For simplicity, I have a data.frame with 3 columns. First 2 are numeric and last one is factor. Group column is my real classes. A and B columns are some kind a numeric representation of these classes. Let’s call them features. Because they have hidden information represent a class. I use 30% of data for training and 70% for test. 

This is the point you asked for. After classification, I have a test.guess.cluster (factor) variable and it contains predicted clusters by knn method (you said that accuracy matrix from KNN, I don’t know what it is). Now, I want to see the clusters on a plot. That’s why, I converted “test.guess.cluster” variable to numeric, so I can use it to colorise the points on the plot. I plotted points in test.df data.frame (A versus B) and coloured them by predicted class.

At the end, I evaluated the overall performance of the knn model. Is it good or bad? Please note that you have to choose your own _k_ value and size of training dataset by trial and error.

library(class)
library(gmodels)
set.seed(6)
df <- data.frame(A = c(rnorm(30, 0), rnorm(30, 3)),
                 B = c(rnorm(30, 0), rnorm(30, 3)),
                 Group = factor(c(rep("G1", 30), rep("G2", 30))))
# use 33% of data for training and 67% is for test
i <- sample(2, nrow(df), replace = TRUE, prob = c(0.67, 0.33))
train.df <- df[i == 2, -3] # do not include last column
train.cl <- df[i == 2, 3] # training result cluters
test.df <- df[i == 1, -3] # test data.frame
test.real.cluster <- df[i == 1, 3] # real clusters for test
# predicted clusters by knn
test.guess.cluster <- knn(train = train.df, test = test.df, cl = train.cl, k = 3)
# convert them to muneric to colorize points on the plot
test.guess.cluster.num <- as.numeric(test.guess.cluster)
plot(test.df, col = test.guess.cluster.num, pch = test.guess.cluster.num)

# examine the result of CrossTable
# The model identified 2 G1 classes as G2 and 1 G2 class as G1.
# Hence, 3 elements are misclassified. (you can distinguish them on the plot)
gm <- gmodels::CrossTable(test.guess.cluster, test.real.cluster, prop.chisq = FALSE)
sum(diag(gm$prop.tbl)) # overall success of the model (34 - 3)/34

> 
> 
> On Monday, May 8, 2017 3:55 PM, Ismail SEZEN <sezenismail at gmail.com> wrote:
> 
> 
> 
> As far as I know, kNN groups by Eucledian distance. So, you need numerical data as input. You said your dataset has only “speeches” and “type of people”. Are these input? or one of them is input and the latter one is output? Type of people should be a factor variable (I guess). I don’t know how you represent “speech” in your dataset. As character or numerical representation of a feature? If you send a minimal example of the problem, we can help you. Please, read posting guide.
> 
> 
> 
>> ______________________________________________
> 
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> 
>> https://stat.ethz.ch/mailman/listinfo/r-help
> 
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> 
>> and provide commented, minimal, self-contained, reproducible code.