[Rd] Very hard to reproduce bug (?) in R-devel

Mon Apr 3 17:34:48 CEST 2017

When running R CMD check on a package, we are encountering an error on
R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
reproduce, and almost any change to the code makes the error go away.
I believe that this is due to a bug in R-devel, which has been present
since at least commit 72128 (on 2017-02-06).

The test error occurs when R CMD check is run on Travis CI (on Ubuntu
12.04.05), and locally on R-devel (on Debian testing). If we make any
of the following changes, the error does not occur, and the test
passes:
- Run the test scripts from an interactive R session
- Run on a Mac instead of Linux
- Use R 3.3.3 or 3.2.5
- Add code right before the tests. For example, just running gc() or
format(1) right before the test causes it to not fail. However, adding
very simple code like 1+1 does not affect the test (it still fails).
- Run R CMD check with -d gdb or -d valgrind

Because the test is so sensitive to any changes, we've been unable to
track down the source of the problem. We can't be sure, but the error
message suggests that a promise is getting evaluated with the wrong
environment.

Here's the relevant output from the tests:
1. Error: pool scheduler: schedules things in the right order
(@test-scheduling.R#13)
could not find function "task"
1: naiveScheduler$protect({
       scheduleTask(1000, function() {
           results <<- c(results, 3L)
       })
       scheduleTask(100, function() {
           results <<- c(results, 2L)
       })
       scheduleTask(10, function() {
           results <<- c(results, 1L)
       })
   }) at testthat/test-scheduling.R:13
2: private$refCount$release() at testthat/test-scheduling.R:13
3: private$callback()
4: self$protect({
       task()
   })
5: private$refCount$release()
6: private$callback()
7: self$protect({
       task()
   })
8: force(expr)

You can see complete logs on Travis here:
  https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771

In the test, a re-entrant function is called (self$protect, at levels
4 and 7 on the call stack), and in the inner call (7), a variable,
`task`, does not have the correct value. It should be a function, but
it isn't; it's either NULL or some other non-function value.

That code is here:
  https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
Note that there's a `force(task)` a couple lines above, which is
commented out. If that is uncommented, the test passes, even though
calling force() there shouldn't make a difference. Also, calling
`format(1)` at that location also causes the test to pass.

I've written instructions to reproduce the problem in a Docker container here:
  https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f

If you want to try reproducing the error without using Docker, I
believe if you download pool_0.1.0.tar.gz file from the
gist.github.com URL above, and run R CMD check on it, that will do it.
Some upstream dependencies will likely need to be installed; they are
listed in the code in the gist.

We also used git bisect to try to find the first commit in R where the
test would fail. The instructions for doing this are also in the gist
linked to above. It found that the first commit where the test failed
was 72128:
  https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d

Taking a look at the changes in that commit, however, it seems very
unlikely that it caused the error. Instead, it probably just made a
change which allowed an already-existing bug to manifest.

We'd appreciate any help finding the cause of this bug!
-Winston