[R] Large data sets with R (binding to hadoop available?)

Fri Aug 29 22:36:02 CEST 2008

Hi Martin,

Sorry for the late reply.  I realize this might now be straying too  
far from r-help, if there is a better forum for this topic (R use  
with Hadoop) please let me know.

I agree it would indeed be great to leverage Hadoop via R syntax or R  
itself.  A first step is figuring out how computations could be  
translated into map and reduce steps.  I am beginning to see efforts  
in this direction:

http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal
http://www.cs.stanford.edu/people/ang//papers/nips06- 
mapreducemulticore.pdf
http://cwiki.apache.org/MAHOUT/

Per Wikipedia, "A mahout is a person who drives an elephant".  It  
would be nice if PIG and R either played well together or adopted  
each other's strengths (in driving the Hadoop elephant)!

Avram

On Aug 22, 2008, at 9:24 AM, Martin Morgan wrote:

> Hi Avram --
>
> My understanding is that Google-like map / reduce achieves throughput
> by coordinating distributed calculation with distributed data.
>
> snow, Rmpi, nws, etc provide a way of distributing calculations, but
> don't help with coordinating distributed calculation with distributed
> data.
>
> SQL (at least naively implemented as a single database server) doesn't
> help with distributed data and the overhead of data movement from the
> server to compute nodes might be devastating. A shared file system
> across compute nodes (the implicit approach usually taken parallel R
> applications) offloads data distribution to the file system, which may
> be effective for not-too-large (10's of GB?) data.
>
> Many non-trivial R algorithms are not directly usable in distributed
> map, because they expect to operate on 'all of the data' rather than
> on data chunks. Out-of-the-box 'reduce' in R is limited really to
> collation (the parallel lapply-like functions) or sapply-like
> simplification; one would rather have more talented reducers (e.g., to
> aggregate bootstrap results).
>
> The list of talents required to exploit Hadoop starts to become
> intimidating (R, Java, Hadoop, PIG, + cluster management, etc), so it
> would certainly be useful to have that encapsulated in a way that
> requires only R skills!
>
> Martin
>
> <Rory.WINSTON at rbs.com> writes:
>
>> Hi
>>
>> Apart from database interfaces such as sqldf which Gabor has
>> mentioned, there are also packages specifically for handling large
>> data: see the "ff" package, for instance.
>>
>> I am currently playing with parallelizing R computations via  
>> Hadoop. I
>> haven't looked at PIG yet though.
>>
>> Rory
>>
>>
>> -----Original Message----- From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Roland Rau  
>> Sent: 21
>> August 2008 20:04 To: Avram Aelony Cc: r-help at r-project.org Subject:
>> Re: [R] Large data sets with R (binding to hadoop available?)
>>
>> Hi
>>
>> Avram Aelony wrote:
>>> Dear R community,
>>> I find R fantastic and use R whenever I can for my data analytic
>>> needs.  Certain data sets, however, are so large that other tools
>>> seem to be needed to pre-process data such that it can be brought
>>> into R for further analysis.
>>> Questions I have for the many expert contributors on this list are:
>>> 1. How do others handle situations of large data sets (gigabytes,
>>> terabytes) for analysis in R ?
>>>
>> I usually try to store the data in an SQLite database and interface
>>> via functions from the packages RSQLite (and DBI).
>>
>> No idea about Question No. 2, though.
>>
>> Hope this helps,
>> Roland
>>
>>
>> P.S. When I am sure that I only need a certain subset of large data
>>> sets, I still prefer to do some pre-processing in awk (gawk).
>> 2.P.S. The size of my data sets are in the gigabyte range (not
>>> terabyte range). This might be important if your data sets are
>>> *really large* and you want to use sqlite:
>>> http://www.sqlite.org/whentouse.html
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ********************************************************************* 
>> **************
>> The Royal Bank of Scotland plc. Registered in Scotland No
>>> 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>> Authorised and regulated by the Financial Services Authority
>>
>> This e-mail message is confidential and for use by
>>> the=2...{{dropped:22}}
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793