[Rd] Request: tools::md5sum should accept connections and finally in-memory objects

Dénes Tóth toth@dene@ @end|ng |rom kogentum@hu
Fri May 1 23:35:04 CEST 2020

On 5/1/20 11:09 PM, John Mount wrote:
> Perhaps use the digest package? Isn't "R the R packages?"

I think it is clear that I am aware of the existence of the digest 
package and also of other packages with similar functionality, e.g. the 
fastdigest package. (And I actually do use digest as I guess 99% percent 
of the R developers do at least as an indirect dependency.) The point is 
a) digest is a wonderful and very stable package, but still, it is a 
user-contributed package, whereas
b) 'tools' is a base package which is included by default in all R 
installations, and
c) tools::md5sum already exists, with almost all building blocks to 
allow its extension to calculate MD5 hashes of R objects, and
d) there is high demand in the R community for being able to calculate 

So yes, if one wants to use all the utilities or the various algos that 
the digest package provides, one should install and load it. But if one 
can live with MD5 hashes, why not use the built-in R function? (Well, 
without serializing an object to a file, calling tools::md5sum, and then 
cleaning up the file.)

>> On May 1, 2020, at 2:00 PM, Dénes Tóth <toth.denes using kogentum.hu 
>> <mailto:toth.denes using kogentum.hu>> wrote:
>> AFAIK there is no hashing utility in base R which can create hash 
>> digests of arbitrary R objects. However, as also described by Henrik 
>> Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes 
>> of files. Calculating hashes of in-memory objects is a very common 
>> task in several areas, as demonstrated by the popularity of the 
>> 'digest' package (~850.000 downloads/month).
>> Upon the inspection of the relevant files in the R-source (e.g., [2] 
>> and [3]), it seems all building blocks have already been implemented 
>> so that hashing should not be restricted to files. I would like to ask:
>> 1) Why is md5_buffer unused?:
>> In src/library/tools/src/md5.c [see 2], md5_buffer is implemented 
>> which seems to be the counterpart of md5_stream for non-file inputs:
>> ---
>> #ifdef UNUSED
>> /* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
>>   result is always in little endian byte order, so that a byte-wise
>>   output yields to the wanted ASCII representation of the message
>>   digest.  */
>> static void *
>> md5_buffer (const char *buffer, size_t len, void *resblock)
>> {
>>  struct md5_ctx ctx;
>>  /* Initialize the computation context.  */
>>  md5_init_ctx (&ctx);
>>  /* Process whole buffer but last len % 64 bytes.  */
>>  md5_process_bytes (buffer, len, &ctx);
>>  /* Put result in desired memory area.  */
>>  return md5_finish_ctx (&ctx, resblock);
>> }
>> #endif
>> ---
>> 2) How can the R-community help so that this feature becomes available 
>> in package 'tools'?
>> Suggestions:
>> As a first step, it would be great if tools::md5sum would support 
>> connections (credit goes to Henrik for the idea). E.g., instead of the 
>> signature tools::md5sum(files), we could have tools::md5sum(files, 
>> conn = NULL), which would allow:
>> x <- runif(10)
>> tools::md5sum(conn = rawConnection(serialize(x, NULL)))
>> To avoid the inconsistency between 'files' (which computes the hash 
>> digests in a vectorized manner, that is, one for each file) and 'conn' 
>> (which expects a single connection), and to make it easier to extend 
>> the hashing for other algorithms without changing the main R 
>> interface, a more involved solution would be to introduce tools::hash 
>> and tools::hashes, in a similar vein to digest::digest and 
>> digest::getVDigest.
>> Regards,
>> Denes
>> [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
>> [2]: 
>> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172
>> [3]: 
>> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27
>> ______________________________________________
>> R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> ---------------
> John Mount
> http://www.win-vector.com/
> Our book: Practical Data Science with R
> http://practicaldatascience.com

More information about the R-devel mailing list