[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()

Henrik Bengtsson henr|k@bengt@@on @end|ng |rom gm@||@com
Thu Mar 28 05:20:36 CET 2019


Thank you Tomas.

For the record, I'm confirming that the stray background R worker
process now times out properly after 'setup_timeout' (= 120) seconds:

{0s}$ Rscript -e 'parallel::makeCluster(1L, port=80)'
Error in socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE,  :
  cannot open the connection
Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  port 80 cannot be opened
Execution halted
{1s}$ ps aux | grep -E "exec[/]R"
hb       17645  2.0  0.3 259104 55144 pts/5    S    20:58   0:00
/home/hb/software/R-devel/trunk/lib/R/bin/exec/R --slave --no-restore
-e parallel:::.slaveRSOCK() --args MASTER=localhost PORT=80
OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE
{2s}$ sleep 120
{122s}$ ps aux | grep -E "exec[/]R"
{122s}$

Good spotting of the bug:

-  if (Sys.time() - t0 > setup_timeout) break
+  if (difftime(Sys.time(), t0, units="secs") > setup_timeout) break

For those who find this thread, I think what's going on here is that
'setup_timeout = 120' is a numeric that is compared a 'difftime' than
keeps changing unit as times goes by.  When compared as 'Sys.time() -
t0 > setup_timeout' the LHS would be in units of seconds as long as
less than 60 seconds had passed:

> Sys.time() - t0
Time difference of 59 secs
> as.numeric(Sys.time() - t0)
[1] 59

However, as soon as more than 60 seconds has passed, the unit turns
into minutes and we're comparing minutes to seconds:

> Sys.time() - t0
Time difference of 1.016667 mins
> as.numeric(Sys.time() - t0)
[1] 1.016667

which is now compared to 'setup_timeout'.  If the unit remained to be
minutes it would timeout after 120 [minutes]. However, after 120
minutes, the unit of Sys.time() - t0 is in hours, and we're comparing
hours to seconds, and so on.  It would only timeout if we used
'setup_timeout' < 60 seconds.

/Henrik

On Wed, Mar 27, 2019 at 12:52 PM Tomas Kalibera
<tomas.kalibera using gmail.com> wrote:
>
>
> The problem causing the stray worker processes when the master fails to
> open a server socket to listen to connections from workers is not
> related to timeout in socketConnection(), because socketConnection()
> will fail right away. It is caused by a bug in checking the setup
> timeout (PR 17391).
>
> Fixed in 76275.
>
> Best
> Tomas
>
> On 3/18/19 2:23 AM, Henrik Bengtsson wrote:
> > (Bcc: CRAN)
> >
> > This is a proposal helping CRAN and alike as well as individual
> > developers to avoid stray R processes being left behind that might be
> > produced when an example or a package test fails to set up a
> > parallel::makeCluster().
> >
> >
> > ISSUE
> >
> > If a package test sets up a PSOCK cluster and then the master process
> > dies for one reason or the other, the PSOCK worker processes will
> > remain running for 30 days ('timeout') until they timeout and
> > terminate that way.  When this happens on CRAN servers, where many
> > packages are checked all the time, this will result in a lot of stray
> > R processes.
> >
> > Here is an example illustrating how R leaves behind stray R processes
> > if fails to establish a connection to one or more background R
> > processes launched by 'parallel::makeCluster()'.  First, let's make
> > sure there are no other R processes running:
> >
> >    $ ps aux | grep -E "exec[/]R"
> >
> > Then, lets create a PSOCK cluster for which connection will fail
> > (because port 80 is reserved):
> >
> >    $ Rscript -e 'parallel::makeCluster(1L, port=80)'
> >    Error in socketConnection("localhost", port = port, server = TRUE,
> > blocking = TRUE,  :
> >      cannot open the connection
> >    Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
> >    In addition: Warning message:
> >    In socketConnection("localhost", port = port, server = TRUE,
> > blocking = TRUE,  :
> >      port 80 cannot be opened
> >
> > The launched R worker is still running:
> >
> >    $ ps aux | grep -E "exec[/]R"
> >    hb       20778 37.0  0.4 283092 70624 pts/0    S    17:50   0:00
> > /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK()
> > --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120
> > TIMEOUT=2 592000 XDR=TRUE
> >
> > This process will keep running for 'TIMEOUT=2592000' seconds (= 30
> > days).  The reason for this is that it is currently in the state where
> > it attempts to set up a connection to the main R process:
> >
> >    > parallel:::.slaveRSOCK
> >    function ()
> >    {
> >        makeSOCKmaster <- function(master, port, setup_timeout, timeout,
> >            useXDR) {
> >     ...
> >            repeat {
> >                con <- tryCatch({
> >                    socketConnection(master, port = port, blocking = TRUE,
> >                      open = "a+b", timeout = timeout)
> >                }, error = identity)
> >        ...
> >    }
> >
> > In other words, it is stuck in 'socketConnection()' and it won't time
> > out until 'timeout' seconds.
> >
> >
> > SUGGESTION
> >
> > To mitigate the problem with above stray processes from running 'R CMD
> > check', we could shorten the 'timeout' which is currently hardcoded to
> > 30 days (src/library/parallel/R/snow.R).  By making it possible to
> > control the default via environment variables, e.g.
> >
> >    setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60
> > * 2)),      # 2 minutes
> >    timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60
> > * 24 * 30)), # 30 days
> >
> > it would be straightforward to adjust `R CMD check` to use, say,
> >
> >    R_PARALLEL_SETUP_TIMEOUT=60
> >
> > by default.  This would cause any stray processes to time out after 60
> > seconds (instead of 30 days as now).
> >
> > /Henrik
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list