[R] Preparing for multi-core CPUs and parallel processing applications

Martin Morgan mtmorgan at fhcrc.org
Fri Jul 31 16:00:19 CEST 2009


Hi Steve --

Steve_Friedman at nps.gov wrote:
> Hello
> 
> I am fortunate (or in really big trouble) in that the research group I work
> with will soon be receiving several high end dual quad core machines. We
> will use the Ubuntu OS on these.  We intend to use this cluster for some
> extensive modeling applications. Our programming guru has demonstrated the
> ability to link much simpler machines to share CPUs and we purchased the
> new ones to take advantage of this option.  We have also begun exploration
> of the R CUDA and J CUDA functionality to push the processes to the
> graphics CPU which greatly speeds up the numerical processing.
> 
> My question(s) to this group:

Last question first, the R-sig-hpc group might be more appropriate for
an extended discussion.

  https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

see also the HighPerformanceComputing task view

  http://cran.fhcrc.org/web/views/HighPerformanceComputing.html


> 1)   Which packages are suitable for parallel processing applications in R
> ?
> 2)  Are these packages ready for prime time applications or are they
> developmental at this time?

I use Rmpi for all my parallel computing, but if I had more time I'd
explore multicore for more efficient use of several CPU on a single
machine, and the new offerings from Revolution computing. If there were
significant portions of C code I'd look into using openMP (as done in
the pnmath library). Also using a parallel BLAS / LAPACK library if that
was where significant computation was occurring.

> 3)  Are we better off working in Java or C++ for the majority of this
> simulation work and linking to R for statistical analysis?
> 4)  What are the pit falls, if any, that I need to be aware of ?

With multiple core, it's important to remember that large memory is
divided amongst cpu, so that huge-sounding 32GB 8 core machine has
'only' 4 GB / cpu when independent R processes are allocated to each cpu
(as is the style with Rmpi).

> 5)  Can we take advantage of sharing the graphics CPU, via R CUDA, in a
> parallel distributed shared cluster of dedicated machines ?
> 
> 6)  Our statistical analysis and modeling applications address very large
> geographic issues.  We generally work with 30-40 year daily time step data
> in a grided format. The grid is approximate 250 x 400 cells in extent, each
> representing approximately 500 meters x 500 meters.  To this we a very
> large suite of ancillary information, both spatial and non-spatial,  to
> simulate a variety of ecological state conditions.  My question is - is
> this too large for R , given its use of memory?

Depending on the application, large data sets can often be managed
effectively on disk, e.g., by using the ncdf package (for large numeric
data) or a data base (R includes sqlite, for instance), and analyzing
independent 'slices'. This fits well with common parallel computing
paradigms.

> 
> 7)  I currently have a laptop with Ubuntu with R Version 2.6.2
> (2008-02-08). What is the most recent R version for Ubuntu and what is the
> installation procedure ?
> 
> These are just the initial questions that I'm sure to have.  If these are
> being directed to the wrong help pages, I'm sorry to have taken your time.
> If you would be so kind as to direct me to the more appropriate help site
> I'd appreciate your assistance.
> 
> Thanks in advance,
> Steve
> 
> 
> Steve Friedman Ph. D.
> Spatial Statistical Analyst
> Everglades and Dry Tortugas National Park
> 950 N Krome Ave (3rd Floor)
> Homestead, Florida 33034
> 
> Steve_Friedman at nps.gov
> Office (305) 224 - 4282
> Fax     (305) 224 - 4147
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list