[R] Rserve and R to R communication

Mon Apr 9 23:02:58 CEST 2007

On 4/9/07, Simon Urbanek <Simon.Urbanek at r-project.org> wrote:
>
> On Apr 7, 2007, at 10:56 AM, Ramon Diaz-Uriarte wrote:
>
> > Dear All,
> >
> > The "clients.txt" file of the latest Rserve package, by Simon
> > Urbanek, says, regarding its R client,
> >
> > "(...) a simple R client, i.e. it allows you to connect to Rserve
> > from R itself. It is very simple and limited,  because Rserve was
> > not primarily meant for R-to-R communication (there are better ways
> > to do that), but it is useful for quick interactive connection to
> > an Rserve farm."
> >
> > Which are those better ways to do it? I am thinking about using
> > Rserve to have an R process send jobs to a bunch of Rserves in
> > different machines. It is like what we could do with Rmpi (or pvm),
> > but without the MPI layer. Therefore, presumably it'd be easier to
> > deal with network problems, machine's failures, using checkpoints,
> > etc. (i.e., to try to get better fault tolerance).
> >
> > It seems that Rserve would provide the basic infrastructure for
> > doing that and saves me from reinventing the wheel of using
> > sockets, etc, directly from R.
> >
> > However, Simon's comment about better ways of R-to-R communication
> > made me wonder if this idea really makes sense. What is the catch?
> > Have other people tried similar approaches?
> >
>
> I was commenting on direct R-to-R communication using sockets +
> 'serialize' in R or the 'snow' package for parallel processing. The
> latter could be useful for what you have in mind, because it includes
> a socket-based implementation which allows you to spawn multiple
> children (across multiple machines) and collect their results. It
> uses regular rsh or ssh to start the jobs, so if can use that, it
> should work for you. 'snow' also has PVM and MPI implementations, the
> PVM one is really easy to setup (on unix) and that was what I was
> using for parallel computing in R on a cluster.
>

I think I now understand your comments. I've used snow and Rmpi quite
a bit. But the problem with Rmpi (or, rather, MPI) is the lack of
fault tolerance: if a node goes down, the whole MPI universe breaks,
and thus the complete set of slaves. Setting up some kind of
fault-tolerant scheme with Rserve seemed possible/simpler (as it does
not depend on the MPI layer).

(Yes, maybe I should check snowFT, but it uses PVM, and I recall a
while back there was a reason why we decided to go with MPI instead of
PVM).

> Rserve is sort of comparable, but in addition it provides the
> spawning infrastructure due to its client/server concept. What it
> doesn't have is the convenience functions that snow provides like
> clusterApply etc. Thinking of it, it would be actually possible to
> add them, although I admit that the original goal of Rserve was not
> parallel computing :). The idea was to have one Rserve server and
> multiple clients

Aha. I should have seen that. I think I understand the differences better now.

> whereas in 'snow' you sort of have one client and
> multiple servers. You could spawn multiple Rserves on multiple
> machines, but Rserve itself doesn't provide any load-balancing out of
> the box, so you'd have to do that yourself.
>

Yes, sure. I think that should be doable, though, if I decide to try
to go down this route.

> I don't know if that helps... :)
>

It does help! Thanks a lot.

Best,

R.
> Cheers,
> Simon
>
>
>
>

-- 
Ramon Diaz-Uriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz