[R] How to remove non-UTF-8 characters from a string

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Oct 26 16:44:48 CEST 2007


That is not a well-defined concept.  To define 'character' you need to 
know the encoding, since that determines how to split the bytes into 
characters.  So only whole strings can be UTF-8 or not.  You can say which 
bytes in a stream of bytes would be valid in UTF-8, but if not all of them 
are then almost certainly it would be incorrect to interpret any of them 
in UTF-8.

You can find out if a stream of bytes is valid in a UTF-8 locale by
nchar(x, "c", allowNA=TRUE) and testing for NA elements in the result.

On Fri, 26 Oct 2007, Bos, Roger wrote:

> All,
>
> I am trying to post text from an XLS spread to my wiki and I need to
> remove any characters that are not UTF-8.  Is there an easy gsub command
> that can do this?
>
> (I previously sent this same email to r-sig-gui.  That was a mistake and
> I apologize for the duplication.)
>
> Thanks, Roger J. Bos

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list