[Rd] Comments requested on "changedFiles" function

Thu Sep 5 23:13:57 CEST 2013

On Thu, Sep 5, 2013 at 6:48 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 13-09-04 11:36 PM, Scott Kostyshak wrote:
>>
>> On Wed, Sep 4, 2013 at 1:53 PM, Duncan Murdoch <murdoch.duncan at gmail.com>
>> wrote:
>>>
>>> In a number of places internal to R, we need to know which files have
>>> changed (e.g. after building a vignette).  I've just written a general
>>> purpose function "changedFiles" that I'll probably commit to R-devel.
>>> Comments on the design (or bug reports) would be appreciated.
>>>
>>> The source for the function and the Rd page for it are inline below.
>>
>>
>> This looks like a useful function. Thanks for writing it. I have only
>> one (picky) comment below.
>>
>>> ----- changedFiles.R:
>>> changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
>>> file.info = NULL,
>>>               md5sum = FALSE, full.names = FALSE, ...) {
>>>      dosnapshot <- function(args) {
>>>          fullnames <- do.call(list.files, c(full.names = TRUE, args))
>>>          names <- do.call(list.files, c(full.names = full.names, args))
>>>          if (isTRUE(file.info) || (is.character(file.info) &&
>>> length(file.info))) {
>>>              info <- file.info(fullnames)
>>>          rownames(info) <- names
>>>              if (isTRUE(file.info))
>>>                  file.info <- c("size", "isdir", "mode", "mtime")
>>>          } else
>>>              info <- data.frame(row.names=names)
>>>      if (md5sum)
>>>          info <- data.frame(info, md5sum = tools::md5sum(fullnames))
>>>      list(info = info, timestamp = timestamp, file.info = file.info,
>>>           md5sum = md5sum, full.names = full.names, args = args)
>>>      }
>>>      if (missing(snapshot) || !inherits(snapshot,
>>> "changedFilesSnapshot")) {
>>>          if (length(timestamp) == 1)
>>>              file.create(timestamp)
>>>          if (missing(snapshot)) snapshot <- "."
>>>          pre <- dosnapshot(list(path = snapshot, ...))
>>>          pre$pre <- pre$info
>>>          pre$info <- NULL
>>>          pre$wd <- getwd()
>>>          class(pre) <- "changedFilesSnapshot"
>>>          return(pre)
>>>      }
>>>
>>>      if (missing(timestamp)) timestamp <- snapshot$timestamp
>>>      if (missing(file.info) || isTRUE(file.info)) file.info <-
>>> snapshot$file.info
>>>      if (identical(file.info, FALSE)) file.info <- NULL
>>>      if (missing(md5sum))    md5sum <- snapshot$md5sum
>>>      if (missing(full.names)) full.names <- snapshot$full.names
>>>
>>>      pre <- snapshot$pre
>>>      savewd <- getwd()
>>>      on.exit(setwd(savewd))
>>>      setwd(snapshot$wd)
>>>
>>>      args <- snapshot$args
>>>      newargs <- list(...)
>>>      args[names(newargs)] <- newargs
>>>      post <- dosnapshot(args)$info
>>>      prenames <- rownames(pre)
>>>      postnames <- rownames(post)
>>>
>>>      added <- setdiff(postnames, prenames)
>>>      deleted <- setdiff(prenames, postnames)
>>>      common <- intersect(prenames, postnames)
>>>
>>>      if (length(file.info)) {
>>>          preinfo <- pre[common, file.info]
>>>          postinfo <- post[common, file.info]
>>>          changes <- preinfo != postinfo
>>>      }
>>>      else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
>>>                             dimnames = list(common, character(0)))
>>>      if (length(timestamp))
>>>          changes <- cbind(changes, Newer = file_test("-nt", common,
>>> timestamp))
>>>      if (md5sum) {
>>>          premd5 <- pre[common, "md5sum"]
>>>          postmd5 <- post[common, "md5sum"]
>>>      changes <- cbind(changes, md5sum = premd5 != postmd5)
>>>      }
>>>      changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop =
>>> FALSE]
>>>      changed <- rownames(changes1)
>>>      structure(list(added = added, deleted = deleted, changed = changed,
>>>          unchanged = setdiff(common, changed), changes = changes), class
>>> =
>>> "changedFiles")
>>> }
>>>
>>> print.changedFilesSnapshot <- function(x, ...) {
>>>      cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n
>>> file.info = ",
>>>          if (length(x$file.info)) paste(paste0('"', x$file.info, '"'),
>>> collapse=","),
>>>          "\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control
>>> =
>>> NULL), "\n", sep="")
>>>      x
>>> }
>>>
>>> print.changedFiles <- function(x, ...) {
>>>      if (length(x$added)) cat("Files added:\n",  paste0("  ", x$added,
>>> collapse="\n"), "\n", sep="")
>>>      if (length(x$deleted)) cat("Files deleted:\n",  paste0("  ",
>>> x$deleted,
>>> collapse="\n"), "\n", sep="")
>>>      changes <- x$changes
>>>      changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
>>>      changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
>>>      if (nrow(changes)) {
>>>          cat("Files changed:\n")
>>>          print(changes)
>>>      }
>>>      x
>>> }
>>> ----------------------
>>>
>>> --- changedFiles.Rd:
>>> \name{changedFiles}
>>> \alias{changedFiles}
>>> \alias{print.changedFiles}
>>> \alias{print.changedFilesSnapshot}
>>> \title{
>>> Detect which files have changed
>>> }
>>> \description{
>>> On the first call, \code{changedFiles} takes a snapshot of a selection of
>>> files.  In subsequent
>>> calls, it takes another snapshot, and returns an object containing data
>>> on
>>> the
>>> differences between the two snapshots.  The snapshots need not be the
>>> same
>>> directory;
>>> this could be used to compare two directories.
>>> }
>>> \usage{
>>> changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info =
>>> NULL,
>>>               md5sum = FALSE, full.names = FALSE, ...)
>>> }
>>> \arguments{
>>>    \item{snapshot}{
>>> The path to record, or a previous snapshot.  See the Details.
>>> }
>>>    \item{timestamp}{
>>> The name of a file to write at the time the initial snapshot
>>> is taken.  In subsequent calls, modification times of files will be
>>> compared
>>> to
>>> this file, and newer files will be reported as changed.  Set to
>>> \code{NULL}
>>> to skip this test.
>>> }
>>>    \item{file.info}{
>>> A vector of columns from the result of the \code{file.info} function, or
>>> a
>>> logical value.  If
>>> \code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be
>>> used.  Set to
>>> \code{FALSE} or \code{NULL} to skip this test.  See the Details.
>>> }
>>>    \item{md5sum}{
>>> A logical value indicating whether MD5 summaries should be taken as part
>>> of
>>> the snapshot.
>>> }
>>>    \item{full.names}{
>>> A logical value indicating whether full names (as in
>>> \code{\link{list.files}}) should be
>>> recorded.
>>> }
>>>    \item{\dots}{
>>> Additional parameters to pass to \code{\link{list.files}} to control the
>>> set
>>> of files
>>> in the snapshots.
>>> }
>>> }
>>> \details{
>>> This function works in two modes.  If the \code{snapshot} argument is
>>> missing or is
>>> not of S3 class \code{"changedFilesSnapshot"}, it is used as the
>>> \code{path}
>>> argument
>>> to \code{\link{list.files}} to obtain a list of files.  If it is of class
>>> \code{"changedFilesSnapshot"}, then it is taken to be the baseline file
>>> and a new snapshot is taken and compared with it.  In the latter case,
>>> missing
>>> arguments default to match those from the initial snapshot.
>>>
>>> If the \code{timestamp} argument is length 1, a file with that name is
>>> created
>>> in the current directory during the initial snapshot, and
>>> \code{\link{file_test}}
>>> is used to compare the age of all files to it during subsequent calls.
>>>
>>> If the \code{file.info} argument is \code{TRUE} or it contains a
>>> non-empty
>>> character vector, the indicated columns from the result of a call to
>>> \code{\link{file.info}} will be recorded and compared.
>>>
>>> If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
>>> will be called to record the 32 byte MD5 checksum for each file, and
>>> these
>>> values
>>> will be compared.
>>> }
>>> \value{
>>> In the initial snapshot phase, an object of class
>>> \code{"changedFilesSnapshot"} is returned.  This
>>> is a list containing the fields
>>> \item{pre}{a dataframe whose rownames are the filenames, and whose
>>> columns
>>> contain the
>>> requested snapshot data}
>>> \item{timestamp, file.info, md5sum, full.names}{a record of the arguments
>>> in
>>> the initial call}
>>> \item{args}{other arguments passed via \code{...} to
>>> \code{\link{list.files}}.}
>>>
>>> In the comparison phase, an object of class \code{"changedFiles"}. This
>>> is a
>>> list containing
>>> \item{added, deleted, changed, unchanged}{character vectors of filenames
>>> from the before
>>> and after snapshots, with obvious meanings}
>>> \item{changes}{a logical matrix with a row for each common file, and a
>>> column for each
>>> comparison test.  \code{TRUE} indicates a change in that test.}
>>>
>>> \code{\link{print}} methods are defined for each of these types. The
>>> \code{\link{print}} method for \code{"changedFilesSnapshot"} objects
>>> displays the arguments used to produce it, while the one for
>>> \code{"changedFiles"} displays the \code{added}, \code{deleted}
>>> and \code{changed} fields if non-empty, and a submatrix of the
>>> \code{changes}
>>> matrix containing all of the \code{TRUE} values.
>>> }
>>> \author{
>>> Duncan Murdoch
>>> }
>>> \seealso{
>>> \code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
>>> }
>>> \examples{
>>> # Create some files in a temporary directory
>>> dir <- tempfile()
>>> dir.create(dir)
>>
>>
>> Should a different name than 'dir' be used since 'dir' is a base function?
>
>
> Such as?

'dir_', 'dir1', 'temp_dir', none of which is a base function. I
thought that it was not recommended to create objects with the same
name as functions, but perhaps this recommended practice is not agreed
on.

>> Further, if someone is not very familiar with R (or just not in "R
>> mode" at the time of reading), they might think that 'dir.create' is
>> calling the create member of the object named 'dir' that you just
>> made.
>
>
> dir.create is an existing function.  I wouldn't have named it that, but
> that's its name.

I meant that if the object is called, e.g. 'temp_dir', one will not
think that 'dir.create' is a call to the 'create' member of 'dir'
because there is no 'dir' object apart from the base function. But
anyone with experience in R would know that this is not how R parses
'dir.create'.

In any case, I shouldn't waste your time on such a minor and subjective thing.

Scott

> Duncan Murdoch
>
>
>>
>> Scott
>>
>>> writeBin(1, file.path(dir, "file1"))
>>> writeBin(2, file.path(dir, "file2"))
>>> dir.create(file.path(dir, "dir"))
>>>
>>> # Take a snapshot
>>> snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)
>>>
>>> # Change one of the files
>>> writeBin(3, file.path(dir, "file2"))
>>>
>>> # Display the detected changes
>>> changedFiles(snapshot)
>>> changedFiles(snapshot)$changes
>>> }
>>> \keyword{utilities}
>>> \keyword{file}
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>
>> --
>> Scott Kostyshak
>> Economics PhD Candidate
>> Princeton University
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>

--
Scott Kostyshak
Economics PhD Candidate
Princeton University