[R] R Parse HTML tabular data and apply NLP

Thu Jul 30 07:23:49 CEST 2015

Hi All,

I have quite a few files which is having HTML tabular data. All the files have have different format, numerous nested tables and different information and the table structure is completely different. The only common thing in these files is that they are in tables.

I was able to read the table using the readHTMLTable function. e.g one file has 23 tables, able to put all data one data frame. Obviously, the read function is not able to interpret the header obviously (which is also it not supposed to), hence creating creating variables like V1, V2..

Now when I have got all the text into a dataframe (the data is scattered in different columns), how do I interpret the text using machine learning to train that this text (sentence,word..)means this, or this text means this. Basically, automatic categorization of the all the text in the dataframe.

I was reading about RTextTools (http://www.rtexttools.com/), well in that case it has be told that this value is for this text and hence further...

Any help would be appreciated.

Regards,
Anshuk Pal Chaudhuri

	[[alternative HTML version deleted]]