[R] R and Hadoop Integrated Processing Environment - RHIPE

Saptarshi Guha saptarshi.guha at gmail.com
Sat Jan 24 19:08:55 CET 2009


Hello,
We have created an interface between R and Hadoop so that the user  
can, after a fashion, interact with very large datasets
using the Map Reduce programming model.  We also use IBM's TSpaces to  
implement a shared memory implementation that can be
accessed via R(somewhat like networkspaces). RHIPE uses Rserve to  
execute R code.

Some of the functions implemented are:
mrlapply - run lapply across a Hadoop cluster
mrsubsetf - subset a file according to an R function
mtapplyf  - run a tapply on a file -
mrmapreduce - run a map reduce algorithm on a file or group of files.  
The user provides a mapper and reducer.

The are also some shared memory operations such as mrread,mrtake,mrput.
Currently, it is at a proof of concept stage and much work is required  
before it is production ready. However, for the adventurous, it is  
possible to use it to process large data.
For more information and examples please visit this page:   http://www.stat.purdue.edu/~sguha/rhipe 
  .

If anyone would like to contribute to this project, please email me  
directly - any help is welcome.

Regards
Saptarshi Guha




More information about the R-help mailing list