[R] Eliminating 'Unprintable ASCII' characters

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Nov 25 09:26:08 CET 2009


I think you mean the control characters: there are other unprintable 
characters (del for example).  They are the character range 
[\001-\037].  E.g.

> test <- intToUtf8(1:40, multiple=TRUE)
> grepl("[\001-\037]", test)
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE

If you want to include del, use "[\001-\037\177]".  I have omitted nul 
(\000) which cannot occur in R character strings.

You didn't give us the sessionInfo() output the posting guide asked 
you for, so I am presuming you are not doing this in an unusual 
locale: I wouldn't trust the regexp code in one of the stateful 
locales used for Japanese.

On Wed, 25 Nov 2009, Steven Kang wrote:

> Hi all,
>
> I have a csv file containing words with *UNPRINTABLE ASCII* characters
> (described in the following table).
>
> Are there any viable method in eliminating these characters?
>
> I realise that *EXTENDED ASCII* characters (i.e , ?, ?, ?, ? etc) can be
> removed or replaced via *"gsub"* or *"gregexpr"* functions. But am not
> certain with the *UNPRINTABLE ASCII* characters.
>
> Your help in resolving this problem would be highly appreciated.
>
> Thanks
>
>
>
>
> Steven
>
>
>
>
>    ASCII control characters (character code 0-31)The first 32 characters in
> the ASCII-table are unprintable control codes and are used to control
> peripherals such as printers.
>   *DEC* *OCT* *HEX* *BIN* *Symbol* *HTML Number* *HTML Name* *Description*
> 0 000 00 00000000 NUL �   Null char 1 001 01 00000001 SOH    Start
> of Heading 2 002 02 00000010 STX    Start of Text 3 003 03 00000011
> ETX    End of Text 4 004 04 00000100 EOT    End of Transmission
> 5 005 05 00000101 ENQ    Enquiry 6 006 06 00000110 ACK 
> Acknowledgment 7 007 07 00000111 BEL    Bell 8 010 08 00001000 BS
>    Back Space 9 011 09 00001001 HT 	   Horizontal Tab 10 012 0A
> 00001010 LF 
   Line Feed 11 013 0B 00001011 VT    Vertical Tab
> 12 014 0C 00001100 FF    Form Feed 13 015 0D 00001101 CR 
> Carriage
> Return 14 016 0E 00001110 SO    Shift Out / X-On 15 017 0F 00001111 SI
>    Shift In / X-Off 16 020 10 00010000 DLE    Data Line Escape
> 17 021 11 00010001 DC1    Device Control 1 (oft. XON) 18 022 12
> 00010010 DC2    Device Control 2 19 023 13 00010011 DC3    Device
> Control 3 (oft. XOFF) 20 024 14 00010100 DC4    Device Control 4 21
> 025 15 00010101 NAK    Negative Acknowledgement 22 026 16 00010110 SYN
>    Synchronous Idle 23 027 17 00010111 ETB    End of Transmit
> Block 24 030 18 00011000 CAN    Cancel 25 031 19 00011001 EM    End
> of Medium 26 032 1A 00011010 SUB    Substitute 27 033 1B 00011011 ESC
>    Escape 28 034 1C 00011100 FS    File Separator 29 035 1D
> 00011101 GS    Group Separator 30 036 1E 00011110 RS    Record
> Separator 31 037 1F 00011111 US    Unit Separator
>
> 	[[alternative HTML version deleted]]
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list