[R] Supervised Learning for Text Classification

Thu Aug 6 09:42:26 CEST 2015

Hi All,

The current process which I am doing:

1.       Reading Unstructured data, Understand what text means what. e.g. a phrase like "JOHN SMITH" this is a customer name. or a phrase like "CLAIM DURATION" is a reason type field. Currently, doing this manually (no issue over here).

2.       Creating a data frame with fields - Reason, Name, Category, Sub-Reason, Description. There is no boolean or numeric field (Do I need to create any? based on #5 need).

3.       Manually feeding the data frame with values as read from the unstructured data text. This would have approximate be having 50 rows. and this is my training set. (no issue over here)

4.       Dataframe #3 would be nothing but a supervised train data set.

Following #5, this is what I need to achieve (and need help here)-

5.       Now when I receive another file (test data),lets say to start - just a phrase "******************", can I predict that text based on the trained model, it is a Reason related phrase (based on the probability, lets say more than 70%), create a new data frame (PredcitedDF) and add a column (e.g. Reason) add a row for this text under Reason field. Again receive on more text phrase, which seems like "Name", add a column "Name" in the PredictedDF and add this value under Name column as second row...and hence further.

I was reading about RTextTools (http://www.rtexttools.com/), well in that case it has be told that this value is for this text and hence further...

Any help would be appreciated.

Regards,
Anshuk Pal Chaudhuri

	[[alternative HTML version deleted]]