[Rd] Small inconsistency in serialize() between R versions and implications on digest()

Paul Murrell p.murrell at auckland.ac.nz
Thu Mar 8 20:30:16 CET 2007


Hi


Luke Tierney wrote:
> On Wed, 7 Mar 2007, Henrik Bengtsson wrote:
> 
>> To follow up, I went ahead and generated "random" object to scan for a
>> common header for a given R version, and it seems to be that at most
>> the first 18 bytes are non-data specific, which could be the length of
>> the serialization header.
>>
>> Here is my code for this:
>>
>> scanSerialize <- function(object, hdr=NULL, ...) {
>>  # Serialize object
>>  raw <- serialize(object, connection=NULL, ascii=TRUE);
>>
>>  # First run?
>>  if (is.null(hdr))
>>    return(raw);
>>
>>  # Find differences between current longest header and new raw vector
>>  n <- length(hdr);
>>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
>>
>>  # No differences?
>>  if (!any(diffs))
>>    return(hdr);
>>
>>  # Position of first difference
>>  idx <- which(diffs)[1];
>>
>>  # Keep common header
>>  hdr <- hdr[seq_len(idx-1)];
>>
>>  hdr;
>> };
>>
>> # Serialize a first "random" object
>> hdr <- scanSerialize(NA);
>> for (kk in 1:100)
>>  hdr <- scanSerialize(kk, hdr=hdr);
>> for (kk in 1:100) {
>>  x <- sample(letters, size=sample(100), replace=TRUE);
>>  hdr <- scanSerialize(x, hdr=hdr);
>> }
>> for (kk in 1:100) {
>>  hdr <- scanSerialize(kk, hdr=hdr);
>>  hdr <- scanSerialize(hdr, hdr=hdr);
>> }
>>
>> cat("Length:", length(hdr), "\n");
>> print(hdr);
>> print(rawToChar(hdr));
>>
>> On R v2.5.0 devel, this gives:
>> Length: 18
>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
>> [1] "A\n2\n132352\n131840\n"
>>
>> However, it would still be good to get an "official" statement from
>> one in the R-code team about the serialization header and where the
>> data section start.  Again, I want to cut out as much as possible for
>> consistency between R version without loosing data dependent bytes.
> 
> An official, and definitive, statement from the _R-core_ team has been
> available to you all along at
> 
>  	https://svn.r-project.org/R/trunk/src/main/serialize.c


There's also a bit of info on this in Section 1.7 of the "R Internals"
Manual.

Paul


> My unofficial and non-definitive interpretation of that statement is
> that there is a header of four items,
> 
>      A format code 'A' or 'X' ('B' also possible in older formats)
>      version number of the format
>      Packed integer containint the R version that did the serializing
>      Packed integer containing the oldest R version that can read the format
> 
> You can see this if you look at the ascii version as text:
> 
>      > serialize(1, stdout(), ascii=TRUE)
>      A
>      2
>      132097
>      131840
>      14
>      1
>      1
>      NULL
>      > serialize(as.integer(1), stdout(), ascii=TRUE)
>      A
>      2
>      132097
>      131840
>      13
>      1
>      1
>      NULL
> 
> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
> In ascii format I believe it is currently 18 bytes but this could
> change with the version number of R -- I'd have to read the official
> and definitive statement to see how the integer packing is done and
> work out whether that could change the number of bytes. The number of
> bytes would also change if we reached format version 10, but something
> about the format would also change of course.  A safer way to look at
> the header in the ascii version is as the first four lines.
> 
> Best,
> 
> luke
> 
>> Thanks
>>
>> /Henrik
>>
>> On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
>>> Hi,
>>>
>>> I noticed that serialize() gives different results depending on R
>>> version, which has implications to the digest() function in the digest
>>> package.  Note, it does give the same output across platforms.  I know
>>> that serialize() is under development, but is this expected, e.g. is
>>> there some kind of header in the result that specifies "who" generated
>>> the stream, and if so, exactly what bytes are they?
>>>
>>> SETUP:
>>>
>>> R versions:
>>> A) R v2.4.0 (2006-10-03)
>>> B) R v2.4.1pat (2007-01-13 r40470)
>>> C) R v2.5.0dev (2006-12-12 r40167)
>>>
>>> This is on WinXP and I start R with Rterm --vanilla.
>>>
>>> Example: Identical serialize() calls using the different R versions.
>>>
>>>> raw <- serialize(1, connection=NULL, ascii=TRUE)
>>>> print(raw)
>>> gives:
>>>
>>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
>>> 0a 31 0a 31 0a
>>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
>>> 0a 31 0a 31 0a
>>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
>>> 0a 31 0a 31 0a
>>>
>>> Note the difference in raw bytes 8 to 10, i.e.
>>>
>>>> raw[7:11]
>>> (A): [1] 32 30 39 36 0a
>>> (B): [1] 32 30 39 37 0a
>>> (C): [1] 32 33 35 32 0a
>>>
>>> Does bytes 8, 9 and 10 in the raw vector somehow contain information
>>> about the R version or similar?  The following poor mans test says
>>> that is the only difference:
>>>
>>> On all R versions, the following gives identical results:
>>>
>>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
>>>> raw <- as.integer(raw[-c(8:10)])
>>>> sum(raw)
>>> [1] 2147884
>>>> sum(log(raw))
>>> [1] 177201.2
>>>
>>> If it is true that there is a R version specific header in serialized
>>> objects, then the digest() function should exclude such header in
>>> order to produce consistent results across R versions, because now
>>> digest(1) gives different results.
>>>
>>> Thank you
>>>
>>> Henrik
>>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 

-- 
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
paul at stat.auckland.ac.nz
http://www.stat.auckland.ac.nz/~paul/



More information about the R-devel mailing list