[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson henrik.bengtsson at gmail.com
Fri Mar 9 06:58:53 CET 2018


A solution is to have parallel:::.slaveRSOCK() attempt to connect
multiple times before failing, e.g.

    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
= 10L, interval = 1.0) {
        port <- as.integer(port)
        for (i in seq_len(maxTries)) {
          con <- tryCatch({
            socketConnection(master, port = port, blocking = TRUE,
                                    open = "a+b", timeout = timeout)
          }, error = identity)
          if (inherits(con, "connection")) break
          Sys.sleep(interval)
        }
        if (inherits(con, "error")) stop(con)
        structure(list(con = con), class = if (useXDR)
            "SOCKnode"
        else "SOCK0node")
    }

One could set 'maxTries' and 'interval' via commandArgs() like what is
done for the other arguments.

I'm happy to submit an SVN patch if R core thinks this is an
acceptable solution.

/Henrik

On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
<henrik.bengtsson at gmail.com> wrote:
> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
> socketConnection() as a character, i.e. there's a missing timeout <-
> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
> as.logical(value):
>
>> parallel:::.slaveRSOCK
> function ()
> {
>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>         port <- as.integer(port)
>         con <- socketConnection(master, port = port, blocking = TRUE,
>             open = "a+b", timeout = timeout)
>         structure(list(con = con), class = if (useXDR)
>             "SOCKnode"
>         else "SOCK0node")
>     }
>     master <- "localhost"
>     port <- NA_integer_
>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>     methods <- TRUE
>     useXDR <- TRUE
>     for (a in commandArgs(TRUE)) {
>         pos <- regexpr("=", a)
>         name <- substr(a, 1L, pos - 1L)
>         value <- substr(a, pos + 1L, nchar(a))
>         switch(name, MASTER = {
>             master <- value
>         }, PORT = {
>             port <- value
>         }, OUT = {
>             outfile <- value
>         }, TIMEOUT = {
>             timeout <- value
>         }, XDR = {
>             useXDR <- as.logical(value)
>         })
>     }
>     if (is.na(port))
>         stop("PORT must be specified")
>     sinkWorkerOutput(outfile)
>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>     cat(msg)
>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
> }
> <bytecode: 0x4bd4b58>
> <environment: namespace:parallel>
>
> Yet, fix that does *not* seem to change anything.
>
> /Henrik
>
> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
> <henrik.bengtsson at gmail.com> wrote:
>> BACKGROUND:
>>
>> While troubleshooting random, occasionally occurring, errors from
>> parallel::makePSOCKcluster("localhost", port = 11000);
>>
>> Error in socketConnection("localhost", port = port, server = TRUE,
>> blocking = TRUE, :
>>     cannot open the connection
>>
>> I had another look at parallel:::newPSOCKnode(), which is used
>> internally to set up each background worker.  It is designed to, first
>> launch the background worker as:
>>
>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>> wait = FALSE)
>>
>> which immediately tries to connect to a socket on localhost:11000 with
>> timeout.  Immediately after the master launched the above (without
>> waiting), it will set up the connection waiting for the connect from
>> the background worker:
>>
>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>
>>
>> ISSUE:
>>
>> If we emulate the above process, and remove the OUT=/dev/null such
>> that we can see the output produces by the worker, as:
>>
>> setup <- function(delay = 0) {
>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>   Sys.sleep(delay)
>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>> = TRUE, open = "a+b", timeout = 20)
>> }
>>
>> doing:
>>
>>> con <- setup(0)
>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>
>> will most likely work, but adding a delay:
>>
>>> con <- setup(5)
>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>> = "a+b",  :
>>   localhost:11000 cannot be opened
>> Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
>>   cannot open the connection
>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>> socketConnection
>>
>> will produce an *instant* error on the worker, and before master opens
>> the server socket.  Eventually, master will produce the originally
>> observed error:
>>
>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>> blocking = TRUE,  :
>>   cannot open the connection
>>
>> In other words, if the master fails to setup socketConnection()
>> *before* the background workers attempts to connect, it all fails.
>> Such a delay may happen for instance when there is a large CPU load on
>> the test machine.
>>
>> Is this the above bug?
>>
>> /Henrik
>>
>> PS. The background is that I, very occasionally, observe R CMD check
>> error on the above (on CRAN and elsewhere) when testing my future
>> package. The error always go away when retested. This far I've though
>> this is due to port clashes (since the port is random selected in
>> [11000:11999]) and accepted that it happens.  However, after
>> discovering the above, it could be due to the worker launching "too
>> soon".
>>
>>> sessionInfo()
>> R version 3.4.3 (2017-11-30)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 16.04.4 LTS
>>
>> Matrix products: default
>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.3



More information about the R-devel mailing list