[Rd] Embedded nuls in strings

Wed Aug 8 03:13:20 CEST 2007

Duncan Murdoch wrote:
> On 07/08/2007 6:29 PM, Herve Pages wrote:
[...]
>> Same for serialization:
>>
>>> save(string0, file="string0.rda")
>>> load("string0.rda")
>>> string0
>> [1] "ABCD"
> 
> Of these, I'd say the serialization is the only case where it would be
> reasonable to fix the behaviour.  R depends on C run-time functions for
> most of the string operations, and they'll stop at a null.  So if this
> isn't documented behaviour, it should be, but it's not reasonable to
> rewrite the C run-time string functions just to handle such weird
> objects.  Functions like "grep" require thousands of lines of code, not
> written by us, and in my opinion maintaining changes to it is not
> something the R project should take on.

I was not (of course) suggesting to fix all the string manipulation functions.
I'm just wondering why R would try to support embedded nuls in the first
place given that they can only be a source of troubles.

What about this:

  > string0
  [1] "ABCD\0F"
  > string0 == "ABCD"
  [1] TRUE

string0 is obviously different from "ABCD"!

Maybe it's easier to change the semantic of rawToChar() so it doesn't return
a string with embedded nuls. More generally speaking, base functions should
always return "clean" strings.

> 
> As to serialization:  there's a comment in the source that embedded
> nulls are handled by it, and that's true up to R-patched, but not in
> R-devel.  Looks like someone has introduced a bug.
> 
> Duncan Murdoch
>>
>> One comment about the nchar man page:
>>   'chars' The number of human-readable characters.
>>
>> "human-readable" seems to be used for "everything but a nul" here
>> which can be confusing.
>> For example one would generally think of ascii codes 1 to 31 as non
>> "human-readable" but
>> nchar() seems to disagree:
>>
>>> string1 <- rawToChar(as.raw(1:31))
>>> string1
>> [1]
>> "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
>>
>>> nchar(string1, type="chars")
>> [1] 31
> 
> No, "human-readable" also has other meanings in multi-byte encodings. If
> an e-acute is encoded in two bytes in your locale, it still only counts
> as one human-readable character.
>