[R] Comparing Latin characters with and without accents?

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Dec 15 18:19:37 CET 2014


On 15/12/2014 05:33, Spencer Graves wrote:
> Hello, All:
>
>
> 	  What do people do to strip accents from latin characters, returning vanilla ASCII?

I think the devil is the detail here: what is Latin?  Latin-1 has 
characters for which this is unclear, let alone Latin-2 or Latin-7.

What I would do is

1) convert to UTF-8 with iconv()
2) convert to Unicode points with utf8ToInt().
3) remap the Unicode characters with an integer lookup table tab[].
4) convert back to UTF-8, then to the desired encoding (or mark as UTF-8 
with Encoding()).

As I suspect all the characters you do want to convert are in the first 
few planes of Unicode, the lookup table can be small, maybe less than 
512 elements.  So for example ú is Unicode 250 and the value of tab[250] 
should be 117.  iconv() with transliteration might give you a good start 
for preparing that table.

(Note that transliteration to two chars is often more acceptable/widely 
applicable. E.g. å to aa and ß to ss.)

>
> 	  For example, I want to convert ‘Raúl’ to “Raul”.  Milan (below) suggested 'iconv(x, “",  "ASCII//TRANSLIT”)’.  This worked under Windows but failed on Linux and Mac.  It’s part of the “subNonStandardCharacters” function in the Ecfun package.  The development version on R-Forge uses this and returns “Raul” under Windows and NA under Mac OS X (and something different from “Raul”, presumably NA, under Linux).
>
>
> 	  Thanks,
> 	  Spencer
>
>
>> On Nov 30, 2014, at 2:32 AM, Spencer Graves <spencer.graves at structuremonitoring.com> wrote:
>>
>> Wonderful.  Thanks very much.  Spencer
>>
>>
>> On 11/30/2014 2:25 AM, Milan Bouchet-Valat wrote:
>>> Le dimanche 30 novembre 2014 à 02:14 -0800, Spencer Graves a écrit :
>>>> Hello:
>>>>
>>>>
>>>>         How can one convert Latin characters with to the corresponding
>>>> characters without?  For example, I want to convert "ú" to "u", similar
>>>> to how tolower('U') returns "u".
>>>>
>>>>
>>>>         This can be done using chartr{base}, e.g., chartr('ú', 'u',
>>>> 'Raúl') returns "Raul".  However, I wondered if a simpler version of
>>>> this is available.
>>> This appears to work:
>>>> iconv("ù", "", "ASCII//TRANSLIT")
>>> [1] "u"
>>>
>>>
>>> Regards
>>>
>>>>         Thanks,
>>>>         Spencer
>>>>
>>>>
>>>> p.s.   findFn('convert to ascii') found 117 help pages in 70 packages.
>>>> A brief review identified two to "Convert to ASCII": ASCIIfy {gtools}
>>>> and stri_enc_toascii {stringi}.  Neither of these did what I expected.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK



More information about the R-help mailing list