[R] Large data sets with R (binding to hadoop available?)

Martin Morgan mtmorgan at fhcrc.org
Fri Aug 22 18:24:29 CEST 2008


Hi Avram --

My understanding is that Google-like map / reduce achieves throughput
by coordinating distributed calculation with distributed data.

snow, Rmpi, nws, etc provide a way of distributing calculations, but
don't help with coordinating distributed calculation with distributed
data.

SQL (at least naively implemented as a single database server) doesn't
help with distributed data and the overhead of data movement from the
server to compute nodes might be devastating. A shared file system
across compute nodes (the implicit approach usually taken parallel R
applications) offloads data distribution to the file system, which may
be effective for not-too-large (10's of GB?) data.

Many non-trivial R algorithms are not directly usable in distributed
map, because they expect to operate on 'all of the data' rather than
on data chunks. Out-of-the-box 'reduce' in R is limited really to
collation (the parallel lapply-like functions) or sapply-like
simplification; one would rather have more talented reducers (e.g., to
aggregate bootstrap results).

The list of talents required to exploit Hadoop starts to become
intimidating (R, Java, Hadoop, PIG, + cluster management, etc), so it
would certainly be useful to have that encapsulated in a way that
requires only R skills!

Martin

<Rory.WINSTON at rbs.com> writes:

> Hi
>
> Apart from database interfaces such as sqldf which Gabor has
> mentioned, there are also packages specifically for handling large
> data: see the "ff" package, for instance.
>
> I am currently playing with parallelizing R computations via Hadoop. I
> haven't looked at PIG yet though.
>
> Rory
>
>
> -----Original Message----- From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Roland Rau Sent: 21
> August 2008 20:04 To: Avram Aelony Cc: r-help at r-project.org Subject:
> Re: [R] Large data sets with R (binding to hadoop available?)
>
> Hi
>
> Avram Aelony wrote:
>> Dear R community,
>> I find R fantastic and use R whenever I can for my data analytic
>> needs.  Certain data sets, however, are so large that other tools
>> seem to be needed to pre-process data such that it can be brought
>> into R for further analysis.
>> Questions I have for the many expert contributors on this list are:
>> 1. How do others handle situations of large data sets (gigabytes,
>> terabytes) for analysis in R ?
>>
> I usually try to store the data in an SQLite database and interface
>> via functions from the packages RSQLite (and DBI).
>
> No idea about Question No. 2, though.
>
> Hope this helps,
> Roland
>
>
> P.S. When I am sure that I only need a certain subset of large data
>> sets, I still prefer to do some pre-processing in awk (gawk).
> 2.P.S. The size of my data sets are in the gigabyte range (not
>> terabyte range). This might be important if your data sets are
>> *really large* and you want to use sqlite:
>> http://www.sqlite.org/whentouse.html
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ***********************************************************************************
> The Royal Bank of Scotland plc. Registered in Scotland No
>> 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
> Authorised and regulated by the Financial Services Authority
>
> This e-mail message is confidential and for use by
>> the=2...{{dropped:22}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list