[R] TM reader with text

Sun Mar 4 15:12:43 CET 2012

Le samedi 03 mars 2012 à 16:56 -0800, Mickael R problem a écrit :
> Hello everybody,
> I don't give up the fight, but it's hard. I have finded a solution for the
> ligature with a best converter wich tranlated more precisely PDF to plain
> text. But a new problem has occured. In french particulary, but it should be
> the case in english too, I have a big problem ' " brackets wich polluted the
> counting of the words. Actullaly the fonction remove ponctuation are not
> able to treated this "punctuation". 
>
> The solution should be to produce a more precise fonction in remove
> punctation which allowed to destroy any bracket. The problem is that
> brackets are not separeted of the word with space, but normally there are
> jsut before or after the word. So, remove punctuation undertand the bracket
> as a part of the word. 
removePunctuation() only handles correctly English punctuation, sadly.
In English, this problem never happens, or only with ending 's, which
does not really matter.

Try this before running removePuncutation():
corpus <- tm_map(corpus, function(x) gsub("[\'\U2019]«»", " ", x))

It will replace quotation marks with a space, and that's enough to
separate them from the rest of the word.

>  Another problem, less important, is the bad account of words in reason of s
> or not and so on. For the fonction TermDocumentMatrix may be there is an
> option for ask only the word, but I don't find it.  
> 
> For the moment I treat this probleme with my little fingers. I open all the
> texts with word to ellimanted all the bracket with a small macro. But it's
> not an easy way with much undred texts in my corpus. 
> For plural I take the word with or without s and i make the difference.
> Fortunaltly, I wish to conserve only 40 more meagningfull words of the
> corpus.
> I know what kind of improvement could be done but I m just a user not an
> ingeneer. I think little improvements could be realize by the magical
> ingeneer wich work for the communauty as I try modestly with my comments.
This is called stemming, and it's implemented by the Snowball package.
You can do this with:
corpus <- tm_map(corpus, stemDocument, language="french")
(after installing Snowball)

You can also try the GUI I'm currently writing to do that easily [1]. No
warranty it will work, but it usually does quite well, though it's still
in development. To install it:
install.packages("RcmdrPlugin.TextMiningSuite",
repos="http://R-Forge.R-project.org")

Hope this helps

1: https://r-forge.r-project.org/projects/rcmdr-tms/