[Rd] Comments requested on "changedFiles" function

Duncan Murdoch murdoch.duncan at gmail.com
Sun Sep 8 16:55:41 CEST 2013


On 13-09-06 11:07 PM, Karl Millar wrote:
> On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
>> On 13-09-06 9:21 PM, Karl Millar wrote:
>>>
>>> Hi Duncan,
>>>
>>> I like the interface of this version a lot better, but there's still a
>>> bunch of implementation details that need fixing:
>>>
>>> * As previously mentioned, there are important cases where the mtime
>>> values change in ways that this code doesn't detect.
>>> * If the timestamp file (which is usually in the temp directory) gets
>>> deleted (which can happen after a moderate amount of time of
>>> inactivity on some systems), then the file_test('-nt', ...) will
>>> always return false, even if the file has changed.
>>
>>
>> If that happened without user intervention, I think it would break other
>> things in R -- the temp directory is supposed to last for the whole session.
>> But I should be checking anyway.
>
> Yes, it does break other things in R -- my experience has been that
> the help system seems to be the one that is impacted the most by this.
>   FWIW, I've never seen the entire R temp directory deleted, just
> individual files and subdirectories in it, but even that probably
> depends on how the machine is configured.  I suspect only a few users
> ever notice this, but my R use is probably somewhat anomalous and I
> think it only happens to R sessions that I haven't used for a few
> days.

I use Windows and never see this; deleting temp files is up to me, not 
to the system.  But my understanding was the *nix systems should only 
clean up /tmp on restart, and I don't think an R session will survive a 
restart.

However, you have convinced me that the use of the timestamp file is not 
beneficial enough to be the default.  I'll leave it as an option, but 
add warnings that it might be unreliable.

>
>>> * If files get added or deleted between the two calls to list.files in
>>> fileSnapshot, it will fail with an error.
>>
>>
>> Yours won't work if path contains more than one directory.  This is probably
>> a reasonable restriction, but it's inconsistent with list.files, so I'd like
>> to avoid it if I can find a way.
>
> I'm currently unsure what the behaviour when comparing snapshots with
> multiple directories should be.
>
> Presumably we should have the property that (horribly abusing notation
> for succinctness):
>        compareSnapshots(c(a1, a2),  c(a1, a2))
> is the same as concatenating (in some form)
>        compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
> and there's a bunch of ways we could concatenate -- we could return a
> list of results, or a single result where each of the 'added, deleted,
> modified' fields are a list, or where we concatenate the 'added,
> deleted, modified' fields together into three simple vectors.
> Concatenating the vectors together like this is appealing, but unless
> you're using the full names, it doesn't include the information of
> which directory the changes are in, and using the full names doesn't
> work in the case where you're comparing different sets of directories,
> e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
> sensible choice for a full name.  The list options don't have this
> problem, but are harder to work with, particularly for the common case
> where there's only a single directory.  You'd also have to be somewhat
> careful with filenames that occur in both directories.
>
> Maybe I'm just being dense, but I don't see a way to do this thats
> clear, easy to use and wouldn't confuse users at the moment.

The way I've done this is to require full.names when multiple dirs are 
on the path.  I've reduced it to one list.files() call per dir, by 
iterating over the path variable and using your approach of calling it 
with full.names = FALSE, then adding the dir if necessary.

I haven't adopted your change that forces comparison of only size and 
mtime from file.info.  I don't see a big cost in storing whatever 
file.info returns (which is system dependent; on Windows I don't see the 
user and group related columns; on Unix I don't see the exe column).
Users might want to detect changes to anything there, and I shouldn't 
make it harder for them.

I've also kept the special-casing of md5sum; it really needs to be 
wrapped in suppressWarnings() (on Unix only).  And I've kept the options 
to specify what changedFiles checks among the file.info columns; I can 
see that you might want a snapshot with everything, but sometimes only 
want to be told about changes in a subset of the attributes.

I've uploaded 
<http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.1.tar.gz> if 
anyone is interested.

Duncan Murdoch

>
> Karl
>
>> Duncan Murdoch
>>
>>
>>> * If the path is on a remote file system, tempdir is local, and
>>> there's significant clock skew, then you can get incorrect results.
>>>
>>> Unfortunately, these aren't just theoretical scenarios -- I've had the
>>> misfortune to run up against all of them in the past.
>>>
>>> I've attached code that's loosely based on your implementation that
>>> solves these problems AFAICT.  Alternatively, Hadley's code handles
>>> all of these correctly, with the exception that compare_state doesn't
>>> handle the case where safe_digest returns NA very well.
>>>
>>> Regards,
>>>
>>> Karl
>>>
>>> On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com>
>>> wrote:
>>>>
>>>> On 13-09-06 7:40 PM, Scott Kostyshak wrote:
>>>>>
>>>>>
>>>>> On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch
>>>>> <murdoch.duncan at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have now put the code into a temporary package for testing; if
>>>>>>> anyone
>>>>>>> is interested, for a few days it will be downloadable from
>>>>>>>
>>>>>>> fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry, error in the URL.  It should be
>>>>>>
>>>>>> http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
>>>>>
>>>>>
>>>>>
>>>>> Works well. A couple of things I noticed:
>>>>>
>>>>> (1)
>>>>> md5sum is being called on directories, which causes warnings. (If this
>>>>> is not viewed as undesirable, please ignore the rest of this comment.)
>>>>> Should this be the responsibility of the user (by passing arguments to
>>>>> list.files)? In the example, changing
>>>>> fileSnapshot(dir, file.info=TRUE, md5sum=TRUE)
>>>>> to
>>>>> fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE,
>>>>> recursive=TRUE")
>>>>>
>>>>> gets rid of the warnings. But perhaps the user just wants to exclude
>>>>> directories for the md5sum calculations. This can't be controlled from
>>>>> fileSnapshot.
>>>>
>>>>
>>>>
>>>> I don't see the warnings, I just get NA values.  I'll try to see why
>>>> there's
>>>> a difference.  (One possibility is my platform (Windows); another is that
>>>> I'm generally testing in R-patched and R-devel rather than the 3.0.1
>>>> release
>>>> version.)  I would rather suppress the warnings than make the user avoid
>>>> them.
>>>>
>>>>
>>>>>
>>>>> Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
>>>>> or file.info to exclude directories (and then fill in the directories
>>>>> with NA)?
>>>>>
>>>>> (2)
>>>>> If I run example(changedFiles) several times, sometimes I get:
>>>>>
>>>>> chngdF> changedFiles(snapshot)
>>>>> File changes:
>>>>>          mtime md5sum
>>>>> file2  TRUE   TRUE
>>>>>
>>>>> and other times I get:
>>>>>
>>>>> chngdF> changedFiles(snapshot)
>>>>> File changes:
>>>>>          md5sum
>>>>> file2   TRUE
>>>>>
>>>>> I wonder why.
>>>>
>>>>
>>>>
>>>> Sometimes the example runs so quickly that the new version has exactly
>>>> the
>>>> same modification time as the original.  That's the risk of the mtime
>>>> check.
>>>> If you put a delay between, you'll get consistent results.
>>>>
>>>> Duncan Murdoch
>>>>
>>>>
>>>>>
>>>>> Scott
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>>
>>>>> R Under development (unstable) (2013-08-31 r63780)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>     [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] testpkg_1.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_3.1.0
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Scott Kostyshak
>>>>> Economics PhD Candidate
>>>>> Princeton University
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>



More information about the R-devel mailing list