[Rd] Cell or PS3 Port

Fri Nov 2 17:51:41 CET 2007

The main core of the Cell (the PPE) uses IBM's version of hyperthreading 
to expose two logical, main CPU's to the OS, so code that is "simply" 
multi-threaded should still see an advantage.  In addition, IBM provides 
an SDK which includes workflow management as well as libraries to 
support common linear algebra and other math functions on the 
sub-processors (called SPE's).  They also provide an interface to a 
hardware RNG as well as 3 software types (2 psuedo, 1 quasi) that are 
coded for the SPE.

Each SPE has its own small, local memory store and communicates with 
main memory using a DMA queue.  It seems to be a question of breaking up 
each task into units that are small enough to offload to an SPE.  My 
initial direction will be to set up a rudimentary workflow manager.  As 
an optimized function is encountered, a sufficient number of SPE threads 
will be spawned and execution of the main thread will wait for all 
results.  As for the optimized functions, I intend to start with the 
ones who already have an analogous implementation in the IBM math libraries.

MPI has been employed by some Cell developers to allow multiple SPE's 
working on sections of the same task to communicate with each other.  I 
like the idea of this approach, since it lays the groundwork to allow 
multiple Cell (or really any) processors to be clustered.

Luke Tierney wrote:
> I have been experimenting with ways of parallelizing many of the
> functions in the math library.  There are two experimental packages
> available in http://www.stat.uiowa.edu/~luke/R/experimental: pnmath,
> based on OpenMP, and pnmath0, based on basic pthreads.  I'm not sure
> to what degree the approach there would carry over to GPUs or Cell
> where the additional processors are different from the main processor
> and may not share memory (I forget how that works on Cell).
> 
> The first issue is that you need some modifications to the some
> functions to ensure they are thread-safe.  For the most part these are
> minor; a few functions would require major changes and I have not
> tackled them for now (Bessel functions, wilcox, signrank I believe).
> RNG functions are also not suitable for parallelization given the
> dependence on the sequential underlying RNG.
> 
> It is not too hard to get parallel versions to use all available
> processor cores. The challenge is to make sure that the parallel
> versions don't run slower than the serial versions. They may if the
> amount of data is too small.  What is too small for each function
> depends on the OS and the processor/memory architecture; if memory is
> not shared this gets more complicated still.  For some very simple
> functions (floor, ceiling, sign) I could not see any reliable benefit
> of parallelization for reasonable data sizes on the systems I was
> using so I left those alone for now.