[Rd] parallel::pvec FUN types differ when v is a list; code simplifications?

Fri Oct 26 06:35:59 CEST 2012

In pvec(list(1, 2), FUN, mc.cores=2) FUN sees integer() arguments whereas 
pvec(list(1, 2, 3), FUN, mc.cores=2) FUN sees list() arguments; the latter seems 
consistent with pvec's description.

This came up in a complicated Bioconductor thread about generics and parallel 
evaluation

   https://stat.ethz.ch/pipermail/bioc-devel/2012-October/003745.html

One relevant point is that a light-weight re-write of parallel/R/unix/pvec.R 
(below) appears to avoid the need for an object v to satisfy

   is.vector
   as.list
   [

and instead only requires

   [

This can be important when as.list() forces an inefficient representation of an 
object that can nonetheless be subset with [. It also seems like the code below 
will be more space efficient, since v is not split into a second object in the 
parent but only subset in the children?

Index: pvec.R
===================================================================

--- pvec.R	(revision 61012)
+++ pvec.R	(working copy)
@@ -21,8 +21,6 @@
   pvec <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
                    mc.cores = getOption("mc.cores", 2L), mc.cleanup = TRUE)
   {
-    if (!is.vector(v)) stop("'v' must be a vector")
-
       env <- parent.frame()
       cores <- as.integer(mc.cores)
       if(cores < 1L) stop("'mc.cores' must be >= 1")
@@ -31,16 +29,7 @@
       if(mc.set.seed) mc.reset.stream()

       n <- length(v)
-    l <- if (n <= cores) as.list(v) else {
-        ## compute the scheduling, making it as fair as possible
-        il <- as.integer(n / cores)
-        xc <- n - il * cores
-        sl <- rep(il, cores)
-        if (xc) sl[1:xc] <- il + 1L
-        si <- cumsum(c(1L, sl))
-        se <- si + c(sl, 0L) - 1L
-        lapply(seq_len(cores), function(ix) v[si[ix]:se[ix]])
-    }
+    si <- splitIndices(n, cores)
       jobs <- NULL
       cleanup <- function() {
           ## kill children if cleanup is requested
@@ -59,8 +48,8 @@
       on.exit(cleanup())
       FUN <- match.fun(FUN)
       ## may have more cores than tasks ....
-    jobs <- lapply(seq_len(min(n, cores)),
-                   function(i) mcparallel(FUN(l[[i]], ...), name = i,
+    jobs <- lapply(si,
+                   function(i) mcparallel(FUN(v[i], ...),
                                             mc.set.seed = mc.set.seed,
                                             silent = mc.silent))
       res <- mccollect(jobs)


On 10/24/2012 05:07 PM, Cook, Malcolm wrote:
> On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
>
>> I agree that it would fruitful to have parLapply in BiocGenerics. It looks
>> to be a flexible abstraction and its presence in the parallel package
>> makes
>> it ubiquitous. If it hasn't been done already, mclapply (and mcmapply)
>> would be good candidates, as well. The fork-based parallelism is
>> substantively different in terms of the API from the more general
>> parallelism of parLapply.
>>
>> Someone was working on some more robust and convenient wrappers around
>> mclapply. Did that ever see the light of day?
>
>
> If you are referring to
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
>
> in which I had offered some small changes to parallel::pvec
>
> 	https://gist.github.com/3757873/
>
> and after which Martin had provided me with a route I have not (yet?)
> followed toward submitting a patch to R for consideration by R-devel /
> Simon Urbanek in
>
> http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-oth
> er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
>
>
>
>
>>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
>>>> mailinglist.honeypot at gmail.com**> wrote:
>>>>
>>>>   In response to a question from yesterday, I pointed someone to the
>>>>> ShortRead `srapply` function and I wondered to myself why it had to
>>>>> necessarily by "burried" in the ShortRead package (aside from it
>>>>> having a `sr` prefix).
>>>>>
>>>>
>>> I don't know that srapply necessarily 'got it right'...
>
>
> One thing I like about srapply is its support for a reduce argument.
>
>>>>> I had thought it might be a good idea to move that (or something like
>>>>> that) to BiocGenerics (unless implementations aren't allowed there)
>>>>> but also realized that it would add more dependencies where someone
>>>>> might not necessarily need them.
>
>
>>>>>
>>>>> But, almost surely, a large majority of the people will be happy to do
>>>>> some form of ||-ization, so in my mind it's not such an onerous thing
>>>>> to add -- on the other hand, this large majority is probably enriched
>>>>> for people who are doing NGS analysis, in which case, keeping it in
>>>>> ShortRead can make some sense.
>
> I remain confused about the need for putting any of this into BiocGenerics
> at all.  It seems to me that properly construed parallization primitives
> ought to 'just work' with any object which supports indexing and length.
>
> I would appreciate hearing arguments to the contrary.
>
> Florian, in a similar vein, could we not seek to change
> parallel::makeCluster to be extensible to, say, support SGE cluster?  THis
> seems like the 'right thing to do'.  ???
>
>
> Regardless, I think we have raised some considerations that might inform
> improvements to parallel, including points about error handling, reducing
> results, block-level parallization over List/Vector (in addition to
> vector), etc.
>
> I think perhaps having a google doc that we can collectively edit to
> contain the requirements we are trying to achieve might move us forward
> effectively. Would this help? Or perhaps a page under
> http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
>
>
>>>>> Taking one step back, I recall some chatter last week (or two) about
>>>>> some better ||-ization "primitives" -- something about a pvec doo-dad,
>>>>> and there being ideas to wrap different types of ||-ization behind an
>>>>> easy to use interface (I think this was the convo), and then I took a
>>>>> further step back and often wonder why we just don't bite the bullet
>>>>> and take advantage of the `foreach` infrastructure that is already out
>>>>> there -- in which case, I could imagne a "doSGE" package that might
>>>>> handle the particulars of what Florain is referring to. You could then
>>>>> configure it externally via some `registerDoSGE(some.config.**object)`
>>>>> and just have the package code happily run it through `foreach(...)
>>>>> %dopar%` and be done w/ it.
>>>>>
>>>>>
>>>>>   IMHO it is relevant.  I have not looked for other abstractions, and
>>>>> this
>>>> one seems
>>>> to work.  Florian's objectives might be a good test case for adequacy.
>>>>
>>>
>>> The registerDoDah does seem to be a useful abstraction.
>
> Is this not more-or-less the intention of parallel::setDefaultCluster?
>
> --Malcolm
>
>
>
>>>
>>> I think there's a lot of work to do for some sort of coordinated
>>> parallelization that putting parLapply into BiocGenerics might
>>> encourage;
>>> not good things will happen when everyone in a call stack tries to
>>> parallelize independently. But I'm in favor of parLapply in
>>> BiocGenerics at
>>> least for the moment.
>>>
>>> Martin
>>>
>>>
>>>
>>>>
>>>>   ... at least, I thought this is what was being talked about here (and
>>>>> popped up a week or two ago) -- sorry if I completely missed the mark
>>>>> ...
>>>>>
>>>>> -steve
>>>>>
>>>>>
>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
>>>>> <florian.hahne at novartis.com> wrote:
>>>>>
>>>>>> Hi Martin,
>>>>>> I could define the generics in my own package, but that would mean
>>>>>> that
>>>>>> those will only be available there, or in the global environment
>>>>>> assuming
>>>>>> that I also export them, or in all additional packages that
>>>>>> explicitly
>>>>>> import them from my name space. Now there already are a whole bunch
>>>>>> of
>>>>>> packages around that all allow for parallelization via a cluster
>>>>>> object.
>>>>>> Obviously those all import the parLapply function from the parallel
>>>>>> package. That means that I can't simply supply my own modified
>>>>>> cluster
>>>>>> object, because the code that calls parLapply will not know about the
>>>>>> generic in my package, even if it is attached. Ideally parLapply
>>>>>> would
>>>>>> be
>>>>>> a generic function already in the parallel package. Not sure who
>>>>>> needs
>>>>>> to
>>>>>> be convinced in order for this to happen, but my gut feeling was
>>>>>> that it
>>>>>> could be easier to have the generic in BiocGenerics.
>>>>>> Maybe I am missing something obvious here, but imo there is no way to
>>>>>> overwrite parLapply globally for my own class unless the generic is
>>>>>> imported by everyone who wants to make use of the special method.
>>>>>> Florian
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>>>>>>
>>>>>>   On 10/17/2012 05:45 AM, Hahne, Florian wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I was wondering whether it would be possible to have proper
>>>>>>>> generics
>>>>>>>>
>>>>>>> for
>>>>>
>>>>>> some of the functions in the parallel package, e.g. parLapply and
>>>>>>>> clusterCall. The reason I am asking is because I want to build an
>>>>>>>> S4
>>>>>>>> class
>>>>>>>> that essentially looks like an S3 cluster object but knows how to
>>>>>>>> deal
>>>>>>>> with the SGE. That way I can abstract away all the overhead
>>>>>>>> regarding
>>>>>>>> job
>>>>>>>> submission, job status and reducing the results in the parLapply
>>>>>>>> method
>>>>>>>> of
>>>>>>>> that class, and would be able to supply this new cluster object to
>>>>>>>> all
>>>>>>>> of
>>>>>>>> my existing functions that can be processed in parallel using a
>>>>>>>> cluster
>>>>>>>> object as input. I have played around with the BatchJobs package
>>>>>>>> as an
>>>>>>>> abstraction layer to SGE and that work nicely. As a test case I
>>>>>>>> have
>>>>>>>> created the necessary generics myself in order to supply my own
>>>>>>>> SGEcluster
>>>>>>>> object to a function that normally deals with the "regular"
>>>>>>>> parallel
>>>>>>>> package S3 cluster objects and everything just worked out of the
>>>>>>>> box,
>>>>>>>> but
>>>>>>>> obviously this fails once I am in a name space and my generic is
>>>>>>>> not
>>>>>>>> found
>>>>>>>> anymore. Of course what we would really want is some proper
>>>>>>>> abstraction
>>>>>>>> of
>>>>>>>> parallelization in R, but for now this seem to be at least a cheap
>>>>>>>> compromise. Any thoughts on this?
>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Florian -- we talked about this locally, but I guess we didn't
>>>>>>> actually send
>>>>>>> any email!
>>>>>>>
>>>>>>> Is there an obstacle to promoting these to generics in your own
>>>>>>> package?
>>>>>>> The
>>>>>>> usual motivation for inclusion in BiocGenerics has been to avoid
>>>>>>> conflicts
>>>>>>> between packages, but I'm not sure whether this is the case (yet)?
>>>>>>> This
>>>>>>> would
>>>>>>> also add a dependency fairly deep in the hierarchy.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>   Florian
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>> Location: Arnold Building M1 B861
>>>>>>> Phone: (206) 667-2793
>>>>>>>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
>>>>>> h/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>    | Memorial Sloan-Kettering Cancer Center
>>>>>    | Weill Medical College of Cornell University
>>>>> Contact Info:
>>>>> http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elianos
>>>>> /contact>
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
>>>>> /mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/
>>>> mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> ______________________________**_________________
>>> Bioc-devel at r-project.org mailing list
>>>
>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/m
>>> ailman/listinfo/bioc-devel>
>>>
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793