[Rd] Embedded nuls in strings

Herve Pages hpages at fhcrc.org
Wed Aug 8 00:29:16 CEST 2007


Duncan Murdoch wrote:
> On 07/08/2007 5:06 PM, Herve Pages wrote:
>> Hi,
>>
>> ?rawToChar
>>      'rawToChar' converts raw bytes either to a single character string
>>      or a character vector of single bytes.  (Note that a single
>>      character string could contain embedded nuls.)
>>
>> Allowing embedded nuls in a string might be an interesting experiment
>> but it
>> seems to cause some troubles to most of the string manipulation
>> functions.
>>
>> A string with an embedded 0:
>>
>>   raw0 <- as.raw(c(65:68, 0 , 70))
>>   string0 <- rawToChar(raw0)
>>
>>> string0
>> [1] "ABCD\0F"
>>
>> nchar() should return 6:
>>> nchar(string0)
>> [1] 4
> 
> You don't state your R version.  The default type of counting in nchar()
> has recently changed from "bytes" (where 6 is correct) to "chars" (where
> 4 is correct).


Oops, sorry:

> sessionInfo()
R version 2.6.0 Under development (unstable) (2007-07-02 r42107)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] rcompgen_0.1-15


And indeed:
  raw0 <- as.raw(c(65:68, 0 , 70))
  string0 <- rawToChar(raw0)

> nchar(string0, type="chars")
[1] 4
> nchar(string0, type="bytes")
[1] 6


In addition to the string functions already mentioned before, it's worth noting that
'paste' doesn't seem to be "embedded nul aware" neither:

> paste(string0, "G", sep="")
[1] "ABCDG"

Same for serialization:

> save(string0, file="string0.rda")
> load("string0.rda")
> string0
[1] "ABCD"

One comment about the nchar man page:
  'chars' The number of human-readable characters.

"human-readable" seems to be used for "everything but a nul" here which can be confusing.
For example one would generally think of ascii codes 1 to 31 as non "human-readable" but
nchar() seems to disagree:

> string1 <- rawToChar(as.raw(1:31))
> string1
[1]
"\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
> nchar(string1, type="chars")
[1] 31


Cheers,
H.


> 
> Duncan Murdoch
>



More information about the R-devel mailing list