[Rd] mclapply returns NULLs on MacOS when running GAM

Wed Apr 29 13:03:31 CEST 2020

> PS. Simon, I think your explicit comment on mcparallel() & friends is
very helpful for many people and developers. It clearly tells
developers to never use mclapply() as the only path through their
code. I'm quite sure not everyone has been or is aware of this. Now
it's clear. Thank you.

I second that, IMO that should land somewhere in manual.

On Wed, Apr 29, 2020 at 6:40 AM Henrik Bengtsson
<henrik.bengtsson using gmail.com> wrote:
>
> On Tue, Apr 28, 2020 at 9:00 PM Shian Su <su.s using wehi.edu.au> wrote:
> >
> > Thanks Simon,
> >
> > I will take note of the sensible default for core usage. I’m trying to achieve small scale parallelism, where tasks take 1-5 seconds and make fuller use of consumer hardware. Its not a HPC-worthy computation but even laptops these days come with 4 cores and I don’t see a reason to not make use of it.
> >
> > The goal for the current piece of code I’m working on is to bootstrap many smoothing fits to generate prediction intervals, this is quite easy to write using mclapply. When you say native with threads, OpenMP, etc… are you referring to at the C/C++ level? From my understanding most parallel packages in R end up calling multicore or snow deep down.
> >
> > I think one of the great advantages of mclapply is that it defaults to lapply when running on a single thread, this makes it much easier to maintain code with optional parallelism. I’m already running into trouble with the fact that PSOCK doesn’t seem to retain loaded packages in spawned processes. I would love to know if there reliable options in R that allow a similar interface to mclapply but use a different and more RStudio-stable mode of parallelisation?
>
> If you use parLapply(cl, ...) and gives the end-users the control over
> the cluster 'cl' object (e.g. via an argument), then they have the
> option to choose from the different types of clusters that cl <-
> parallel::makeCluster(...) can create, notably PSOCK, FORK and MPI
> cluster but the framework support others.
>
> The 'foreach' framework takes this separation of *what* to parallelize
> (which you decide as a developer) and *how* to parallel (which the
> end-user decides) further by so called foreach adaptors aka parallel
> backends.  With foreach, users have plently of doNnn packages to pick
> from, doMC, doParallel, doMPI, doSnow, doRedis, and doFuture.  Several
> of these parallel backends build on top of the core functions provided
> by the 'parallel' package.  So, with foreach your users can use forked
> parallel processing if they want and, or something else (selected at
> the top of their script).
>
> (Disclaimer: I'm the author) The 'future' framework tries to take this
> developer-end-user separation one step further and with a lower level
> API - future(), value(), resolved() - for which different parallel
> backends have been implemented, e.g. multicore, multisession
> ("PSOCK"), cluster (any parallel::makeCluster() cluster), callr,
> batchtools (HPC job schedulers), etc.  All these have been tested to
> conform to the Future API specs, so we know our parallel code works
> regardless of which of these backends the user picks.  Now, based on
> these basic future low-level functions, other higher level APIs have
> been implemented.  For instance, the future.apply packages provides
> futurized version of all base R apply functions, e.g. future_lapply(),
> future_vapply(), future_Map(), etc.  You can basically take you
> lapply(...) code and replace it with future_lapply(...) and things
> will just work.  So, try replacing your current mclapply() with
> future_lapply().  If you/the user uses the 'multicore' backend - set
> by plan(multicore) at top of script, you'll get basically what
> mclapply() provides.  If plan(multisession) is used, the you basically
> get what parLapply() does.  The difference is that you don't have to
> worry about globals and packages.  If you like the foreach-style of
> map-reduce, you can use futures via the doFuture backend.  If you like
> the purrr-style of map-reduce, you can use the 'furrr' package.  So,
> and I'm obviously biased, if you pick the future framework, you'll
> leave yourself and end-users with more options going forward.
>
> Clear as mud?
>
> /Henrik
>
> PS. Simon, I think your explicit comment on mcparallel() & friends is
> very helpful for many people and developers. It clearly tells
> developers to never use mclapply() as the only path through their
> code. I'm quite sure not everyone has been or is aware of this. Now
> it's clear. Thank you.
>
> >
> > Thanks,
> > Shian
> >
> > > On 29 Apr 2020, at 1:33 pm, Simon Urbanek <simon.urbanek using R-project.org> wrote:
> > >
> > > Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.
> > >
> > > As for parallelism, it depends heavily on your use-case. Native parallelism is preferred (threads, OpenMP, ...) and I assume you're not talking about that as that is always the first option. Multicore works well in cases where there is no easy native solution and you need to share a lot of data for small results. If the data is small, or you need to read it first, then other methods like PSOCK may be preferable. In any case, parallelization only makes sense for code that you know will take a long time to run.
> > >
> > > Cheers,
> > > Simon
> > >
> > >
> > >> On 29/04/2020, at 11:54 AM, Shian Su <su.s using wehi.edu.au> wrote:
> > >>
> > >> Thanks Henrik,
> > >>
> > >> That clears things up significantly. I did see the warning but failed to include it my initial email. It sounds like an RStudio issue, and it seems like that it’s quite intrinsic to how forks interact with RStudio. Given this code is eventually going to be a part of a package, should I expect it to fail mysteriously in RStudio for my users? Is the best solution here to migrate all my parallelism to PSOCK for the foreseeable future?
> > >>
> > >> Thanks,
> > >> Shian
> > >>
> > >>> On 29 Apr 2020, at 2:08 am, Henrik Bengtsson <henrik.bengtsson using gmail.com> wrote:
> > >>>
> > >>> Hi, a few comments below.
> > >>>
> > >>> First, from my experience and troubleshooting similar reports from
> > >>> others, a returned NULL from parallel::mclapply() is often because the
> > >>> corresponding child process crashed/died. However, when this happens
> > >>> you should see a warning, e.g.
> > >>>
> > >>>> y <- parallel::mclapply(1:2, FUN = function(x) if (x == 2) quit("no") else x)
> > >>> Warning message:
> > >>> In parallel::mclapply(1:2, FUN = function(x) if (x == 2) quit("no") else x) :
> > >>> scheduled core 2 did not deliver a result, all values of the job
> > >>> will be affected
> > >>>> str(y)
> > >>> List of 2
> > >>> $ : int 1
> > >>> $ : NULL
> > >>>
> > >>> This warning is produces on R 4.0.0 and R 3.6.2 in Linux, but I would
> > >>> assume that warning is also produced on macOS.  It's not clear from
> > >>> you message whether you also got that warning or not.
> > >>>
> > >>> Second, forked processing, as used by parallel::mclapply(), is advised
> > >>> against when using the RStudio Console [0].  Unfortunately, there's no
> > >>> way to disable forked processing in R [1].  You could add the
> > >>> following to your ~/.Rprofile startup file:
> > >>>
> > >>> ## Warn when forked processing is used in the RStudio Console
> > >>> if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM"))) {
> > >>> invisible(trace(parallel:::mcfork, tracer =
> > >>> quote(warning("parallel::mcfork() was used. Note that forked
> > >>> processes, e.g. parallel::mclapply(), may be unstable when used from
> > >>> the RStudio Console
> > >>> [https://github.com/rstudio/rstudio/issues/2597#issuecomment-482187011]",
> > >>> call.=FALSE))))
> > >>> }
> > >>>
> > >>> to detect when forked processed is used in the RStudio Console -
> > >>> either by you or by some package code that you use directly or
> > >>> indirectly.  You could even use stop() here if you wanna be
> > >>> conservative.
> > >>>
> > >>> [0] https://github.com/rstudio/rstudio/issues/2597#issuecomment-482187011
> > >>> [1] https://stat.ethz.ch/pipermail/r-devel/2020-January/078896.html
> > >>>
> > >>> /Henrik
> > >>>
> > >>> On Tue, Apr 28, 2020 at 2:39 AM Shian Su <su.s using wehi.edu.au> wrote:
> > >>>>
> > >>>> Yes I am running on Rstudio 1.2.5033. I was also running this code without error on Ubuntu in Rstudio. Checking again on the terminal and it does indeed work fine even with large data.frames.
> > >>>>
> > >>>> Any idea as to what interaction between Rstudio and mclapply causes this?
> > >>>>
> > >>>> Thanks,
> > >>>> Shian
> > >>>>
> > >>>> On 28 Apr 2020, at 7:29 pm, Simon Urbanek <simon.urbanek using R-project.org<mailto:simon.urbanek using R-project.org>> wrote:
> > >>>>
> > >>>> Sorry, the code works perfectly fine for me in R even for 1e6 observations (but I was testing with R 4.0.0). Are you using some kind of GUI?
> > >>>>
> > >>>> Cheers,
> > >>>> Simon
> > >>>>
> > >>>>
> > >>>> On 28/04/2020, at 8:11 PM, Shian Su <su.s using wehi.edu.au<mailto:su.s using wehi.edu.au>> wrote:
> > >>>>
> > >>>> Dear R-devel,
> > >>>>
> > >>>> I am experiencing issues with running GAM models using mclapply, it fails to return any values if the data input becomes large. For example here the code runs fine with a df of 100 rows, but fails at 1000.
> > >>>>
> > >>>> library(mgcv)
> > >>>> library(parallel)
> > >>>>
> > >>>> df <- data.frame(
> > >>>> +     x = 1:100,
> > >>>> +     y = 1:100
> > >>>> + )
> > >>>>
> > >>>> mclapply(1:2, function(i, df) {
> > >>>> +         fit <- gam(y ~ s(x, bs = "cs"), data = df)
> > >>>> +     },
> > >>>> +     df = df,
> > >>>> +     mc.cores = 2L
> > >>>> + )
> > >>>> [[1]]
> > >>>>
> > >>>> Family: gaussian
> > >>>> Link function: identity
> > >>>>
> > >>>> Formula:
> > >>>> y ~ s(x, bs = "cs")
> > >>>>
> > >>>> Estimated degrees of freedom:
> > >>>> 9  total = 10
> > >>>>
> > >>>> GCV score: 0
> > >>>>
> > >>>> [[2]]
> > >>>>
> > >>>> Family: gaussian
> > >>>> Link function: identity
> > >>>>
> > >>>> Formula:
> > >>>> y ~ s(x, bs = "cs")
> > >>>>
> > >>>> Estimated degrees of freedom:
> > >>>> 9  total = 10
> > >>>>
> > >>>> GCV score: 0
> > >>>>
> > >>>>
> > >>>>
> > >>>> df <- data.frame(
> > >>>> +     x = 1:1000,
> > >>>> +     y = 1:1000
> > >>>> + )
> > >>>>
> > >>>> mclapply(1:2, function(i, df) {
> > >>>> +         fit <- gam(y ~ s(x, bs = "cs"), data = df)
> > >>>> +     },
> > >>>> +     df = df,
> > >>>> +     mc.cores = 2L
> > >>>> + )
> > >>>> [[1]]
> > >>>> NULL
> > >>>>
> > >>>> [[2]]
> > >>>> NULL
> > >>>>
> > >>>> There is no error message returned, and the code runs perfectly fine in lapply.
> > >>>>
> > >>>> I am on a MacBook 15 (2016) running MacOS 10.14.6 (Mojave) and R version 3.6.2. This bug could not be reproduced on my Ubuntu 19.10 running R 3.6.1.
> > >>>>
> > >>>> Kind regards,
> > >>>> Shian Su
> > >>>> ----
> > >>>> Shian Su
> > >>>> PhD Student, Ritchie Lab 6W, Epigenetics and Development
> > >>>> Walter & Eliza Hall Institute of Medical Research
> > >>>> 1G Royal Parade, Parkville VIC 3052, Australia
> > >>>>
> > >>>>
> > >>>> _______________________________________________
> > >>>>
> > >>>> The information in this email is confidential and intend...{{dropped:26}}
> > >>>>
> > >>>> ______________________________________________
> > >>>> R-devel using r-project.org mailing list
> > >>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>
> > >> _______________________________________________
> > >>
> > >> The information in this email is confidential and intended solely for the addressee.
> > >> You must not disclose, forward, print or use it without the permission of the sender.
> > >>
> > >> The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin
> > >> Nation as the traditional owners of the land where our campuses are located and
> > >> the continuing connection to country and community.
> > >> _______________________________________________
> > >> ______________________________________________
> > >> R-devel using r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> >
> > _______________________________________________
> >
> > The information in this email is confidential and intended solely for the addressee.
> > You must not disclose, forward, print or use it without the permission of the sender.
> >
> > The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin
> > Nation as the traditional owners of the land where our campuses are located and
> > the continuing connection to country and community.
> > _______________________________________________
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel