[R] TM reader with text
Mickael R problem
clevenot.mickael at gmail.com
Thu Mar 1 16:07:15 CET 2012
clearly there is a problem with latin ligature because the word resulting
from my ask with findFreqTerms give me some words > "<U+FB01>n"
>> "<U+FB01>nancier" "<U+FB01>nanciÃ¨re" "<U+FB01>nanciÃ¨res"
>> "<U+FB01>nanciers" "<U+FB01>xe"
where U+FB01 is a code for latin ligature. The problem is well identified
Now, how can I tretaed it. The package TAU seems to offer a solution for
text but not for corpus.
quoation TAU " translate Translate Unicode Latin Ligatures Description
Translate Unicode “Latin ligature” characters to their respective
constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
x a character vector in UTF-8 encoding.
Details In typography, a ligature occurs where two or more graphemes are
joined as a single glyph. (See
http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
0132 LATIN CAPITAL LIGATURE IJ
0133 LATIN SMALL LIGATURE IJ
0152 LATIN CAPITAL LIGATURE OE
0153 LATIN SMALL LIGATURE OE
FB00 LATIN SMALL LIGATURE FF
FB01 LATIN SMALL LIGATURE FI
FB02 LATIN SMALL LIGATURE FL
FB03 LATIN SMALL LIGATURE FFI
FB04 LATIN SMALL LIGATURE FFL
FB05 LATIN SMALL LIGATURE LONG S T
FB06 LATIN SMALL LIGATURE ST
translate_Unicode_latin_ligatures translates these to their respective
I need this king of fonction for corpus not only text or characters. Any
Thank's for comments and your answers. We are in progress!
View this message in context: http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4435229.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help