[Rd] mclapply memory leak?

Toby Hocking tdhock5 at gmail.com
Fri Sep 4 15:14:36 CEST 2015


Thanks for the detailed analysis Simon. I figured out a workaround that
seems to be working in my real application. By limiting the length of the
first argument to mclapply (to the number of cores), I get speedups while
limiting the memory overhead.

### Run mclapply inside of a for loop, ensuring that it never receives
### a first argument with a length more than maxjobs. This avoids some
### memory problems (swapping, or getting jobs killed on the cluster)
### when using mclapply(1:N, FUN) where N is large.
maxjobs.mclapply <- function(X, FUN, maxjobs=getOption("mc.cores")){
  N <- length(X)
  i.list <- splitIndices(N, N/maxjobs)
  result.list <- list()
  for(i in seq_along(i.list)){
    i.vec <- i.list[[i]]
    result.list[i.vec] <- mclapply(X[i.vec], FUN)
  }
  result.list
}


On Thu, Sep 3, 2015 at 5:27 PM, Simon Urbanek <simon.urbanek at r-project.org>
wrote:

> Toby,
>
> > On Sep 2, 2015, at 1:12 PM, Toby Hocking <tdhock5 at gmail.com> wrote:
> >
> > Dear R-devel,
> >
> > I am running mclapply with many iterations over a function that modifies
> > nothing and makes no copies of anything. It is taking up a lot of memory,
> > so it seems to me like this is a bug. Should I post this to
> > bugs.r-project.org?
> >
> > A minimal reproducible example can be obtained by first starting a memory
> > monitoring program such as htop, and then executing the following code
> > while looking at how much memory is being used by the system
> >
> > library(parallel)
> > seconds <- 5
> > N <- 100000
> > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds))
> >
> > On my system, memory usage goes up about 60MB on this example. But it
> does
> > not go up at all if I change mclapply to lapply. Is this a bug?
> >
> > For a more detailed discussion with a figure that shows that the memory
> > overhead is linear in N, please see
> > https://github.com/tdhock/mclapply-memory
> >
>
>
> I'm not quite sure what is supposed to be the issue here. One would expect
> the memory used will be linear in the number elements you process - by
> definition of the task, since you'll be creating linearly many more objects.
>
> Also using top doesn't actually measure the memory used by R itself (see
> FAQ 7.42).
>
> That said, I re-run your script and it didn't look anything like what you
> have on your webpage.  For the NULL result you end up dealing will all the
> objects you create in your test that overshadow any memory usage and
> stabilizes after garbage-collection. As you would expect, any output of top
> is essentially bogus up to a gc. How much memory R will use is essentially
> governed by the level at which you set the gc trigger. In real world you
> actually want that to be fairly high if you can afford it (in gigabytes,
> not megabytes), because you get often much higher performance by delaying
> gcs if you don't have low total memory (essentially using the memory as a
> buffer). Given that the usage is so negligible, it won't trigger any gc on
> its own, so you're just measuring accumulated objects - which will be
> always higher for mclapply because of the bookkeeping and serialization
> involved in the communication.
>
> The real difference is only in the df case. The reason for it is that your
> lapply() there is simply a no-op, because R is smart enough to realize that
> you are always returning the same object so it won't actually create
> anything and just return a reference back to df - thus using no memory at
> all. However, once you split the inputs, your main session can no longer
> perform this optimization because the processing is now in a separate
> process, so it has no way of knowing that you are returning the object
> unmodified. So what you are measuring is a special case that is arguably
> not really relevant in real applications.
>
> Cheers,
> Simon
>
>
>
> >> sessionInfo()
> > R version 3.2.2 (2015-08-14)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu precise (12.04.5 LTS)
> >
> > locale:
> > [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
> > [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_CA.UTF-8
> > [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_CA.UTF-8
> > [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> > [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] parallel  graphics  utils     datasets  stats     grDevices methods
> > [8] base
> >
> > other attached packages:
> > [1] ggplot2_1.0.1      RColorBrewer_1.0-5 lattice_0.20-33
> >
> > loaded via a namespace (and not attached):
> > [1] Rcpp_0.11.6             digest_0.6.4            MASS_7.3-43
> > [4] grid_3.2.2              plyr_1.8.1              gtable_0.1.2
> > [7] scales_0.2.3            reshape2_1.2.2          proto_1.0.0
> > [10] labeling_0.2            tools_3.2.2             stringr_0.6.2
> > [13] dichromat_2.0-0         munsell_0.4.2
>  PeakSegJoint_2015.08.06
> > [16] compiler_3.2.2          colorspace_1.2-4
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list