[Rd] [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Wed Apr 10 20:43:27 CEST 2013

Le mercredi 10 avril 2013 à 13:17 +0200, Ingo Feinerer a écrit :
> On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote:
> > Thanks for the reproducible example. Indeed, it does not work here
> > either (Linux with UTF-8 locale). The problem seems to be in the call to
> > gsub() in removeWords: the pattern "\\b" does not match anything when
> > perl=TRUE. With perl=FALSE, it works.
> 
> The \b versus perl versus UTF-8 issue seems to be known, and it is
> advised to use perl = TRUE with \b. See e.g. the warning in the gsub
> help page (?gsub):
> 
> ---8<--------------------------------------------------------------------------
> Warning:
> 
> POSIX 1003.2 mode of ‘gsub’ and ‘gregexpr’ does not work correctly with
> repeated word-boundaries (e.g. ‘pattern = "\b"’).  Use ‘perl = TRUE’ for
> such matches (but that may not work as expected with non-ASCII inputs,
> as the meaning of ‘word’ is system-dependent).
> ---8<--------------------------------------------------------------------------
Thanks for the pointer. Indeed, this allowed me to discover the
existence of the PCRE_UCP (Unicode Character Properties) flag, which
changes matching behavior so that Unicode alphanumerics are not
considered as word boundaries.

This flag should probably be used by R when calling pcre_compile() in
gsub() and friends. At the moment, R's behavior is inconsistent across
platforms:
- on Fedora 18, R 2.15.3 :
gsub("\\bt\\b", "", "télégramme", perl=TRUE)
[1] "élégramme"

- on Windows 2008, R 2.15.1 and 3.0.0 :
gsub("\\bt\\b", "", "télégramme", perl=TRUE)
[1] "télégramme"

Luckily, the bug can be fixed at tm's level by adding (*UCP) at the
beginning of the pattern. This works for our examples :

> gsub(sprintf("\\b(%s)\\b", "който"), "", "който", perl=TRUE)
[1] "който"
> gsub(sprintf("(*UCP)\\b(%s)\\b", "който"), "", "който", perl=TRUE)
[1] ""

gsub("\\bt\\b", "", "télégramme", perl=TRUE)
[1] "élégramme"
gsub("(*UCP)\\bt\\b", "", "télégramme", perl=TRUE)
[1] "télégramme"

Regards