[Rd] Request: tools::md5sum should accept connections and finally in-memory objects

Dénes Tóth toth@dene@ @end|ng |rom kogentum@hu
Fri May 1 23:00:30 CEST 2020

AFAIK there is no hashing utility in base R which can create hash 
digests of arbitrary R objects. However, as also described by Henrik 
Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of 
files. Calculating hashes of in-memory objects is a very common task in 
several areas, as demonstrated by the popularity of the 'digest' package 
(~850.000 downloads/month).

Upon the inspection of the relevant files in the R-source (e.g., [2] and 
[3]), it seems all building blocks have already been implemented so that 
hashing should not be restricted to files. I would like to ask:

1) Why is md5_buffer unused?:
In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which 
seems to be the counterpart of md5_stream for non-file inputs:

#ifdef UNUSED
/* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
    result is always in little endian byte order, so that a byte-wise
    output yields to the wanted ASCII representation of the message
    digest.  */
static void *
md5_buffer (const char *buffer, size_t len, void *resblock)
   struct md5_ctx ctx;

   /* Initialize the computation context.  */
   md5_init_ctx (&ctx);

   /* Process whole buffer but last len % 64 bytes.  */
   md5_process_bytes (buffer, len, &ctx);

   /* Put result in desired memory area.  */
   return md5_finish_ctx (&ctx, resblock);

2) How can the R-community help so that this feature becomes available 
in package 'tools'?

As a first step, it would be great if tools::md5sum would support 
connections (credit goes to Henrik for the idea). E.g., instead of the 
signature tools::md5sum(files), we could have tools::md5sum(files, conn 
= NULL), which would allow:

x <- runif(10)
tools::md5sum(conn = rawConnection(serialize(x, NULL)))

To avoid the inconsistency between 'files' (which computes the hash 
digests in a vectorized manner, that is, one for each file) and 'conn' 
(which expects a single connection), and to make it easier to extend the 
hashing for other algorithms without changing the main R interface, a 
more involved solution would be to introduce tools::hash and 
tools::hashes, in a similar vein to digest::digest and digest::getVDigest.


[1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21

More information about the R-devel mailing list