[Rd] Embedded nuls in strings

Wed Aug 8 02:10:27 CEST 2007

On 07/08/2007 6:29 PM, Herve Pages wrote:
> Duncan Murdoch wrote:
>> On 07/08/2007 5:06 PM, Herve Pages wrote:
>>> Hi,
>>>
>>> ?rawToChar
>>>      'rawToChar' converts raw bytes either to a single character string
>>>      or a character vector of single bytes.  (Note that a single
>>>      character string could contain embedded nuls.)
>>>
>>> Allowing embedded nuls in a string might be an interesting experiment
>>> but it
>>> seems to cause some troubles to most of the string manipulation
>>> functions.
>>>
>>> A string with an embedded 0:
>>>
>>>   raw0 <- as.raw(c(65:68, 0 , 70))
>>>   string0 <- rawToChar(raw0)
>>>
>>>> string0
>>> [1] "ABCD\0F"
>>>
>>> nchar() should return 6:
>>>> nchar(string0)
>>> [1] 4
>> You don't state your R version.  The default type of counting in nchar()
>> has recently changed from "bytes" (where 6 is correct) to "chars" (where
>> 4 is correct).
> 
> 
> Oops, sorry:
> 
>> sessionInfo()
> R version 2.6.0 Under development (unstable) (2007-07-02 r42107)
> x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-15
> 
> 
> And indeed:
>   raw0 <- as.raw(c(65:68, 0 , 70))
>   string0 <- rawToChar(raw0)
> 
>> nchar(string0, type="chars")
> [1] 4
>> nchar(string0, type="bytes")
> [1] 6
> 
> 
> In addition to the string functions already mentioned before, it's worth noting that
> 'paste' doesn't seem to be "embedded nul aware" neither:
> 
>> paste(string0, "G", sep="")
> [1] "ABCDG"
> 
> Same for serialization:
> 
>> save(string0, file="string0.rda")
>> load("string0.rda")
>> string0
> [1] "ABCD"

Of these, I'd say the serialization is the only case where it would be 
reasonable to fix the behaviour.  R depends on C run-time functions for 
most of the string operations, and they'll stop at a null.  So if this 
isn't documented behaviour, it should be, but it's not reasonable to 
rewrite the C run-time string functions just to handle such weird 
objects.  Functions like "grep" require thousands of lines of code, not 
written by us, and in my opinion maintaining changes to it is not 
something the R project should take on.

As to serialization:  there's a comment in the source that embedded 
nulls are handled by it, and that's true up to R-patched, but not in 
R-devel.  Looks like someone has introduced a bug.

Duncan Murdoch
> 
> One comment about the nchar man page:
>   'chars' The number of human-readable characters.
> 
> "human-readable" seems to be used for "everything but a nul" here which can be confusing.
> For example one would generally think of ascii codes 1 to 31 as non "human-readable" but
> nchar() seems to disagree:
> 
>> string1 <- rawToChar(as.raw(1:31))
>> string1
> [1]
> "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
>> nchar(string1, type="chars")
> [1] 31

No, "human-readable" also has other meanings in multi-byte encodings. 
If an e-acute is encoded in two bytes in your locale, it still only 
counts as one human-readable character.