[R] Help retrieving only Portuguese words from a file
b.rowlingson at lancaster.ac.uk
Tue May 28 18:12:14 CEST 2013
On Tue, May 28, 2013 at 5:02 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> And some words exist in Portuguese, Spanish and English, the three
> languages of the problem. For instance, "animal". I don't think this
> problem can be solved, but a dictionary search would tell if it is a
> Portuguese word, which it is.
Is there any structure to the text? If it has complete paragraphs in
one of the three languages then you can probably make a better guess
of the language of the paragraph from the presence of key words. I
wonder if some of the code for detecting spam can help you here...
Train it on some known Portuguese, Spanish, and English text...
If its just a stream of words in one of the languages in a random
order then it is difficult or impossible.
More information about the R-help