[BioC] bioconductor on EMR / mapreduce

Dan Tenenbaum dtenenba at fhcrc.org
Wed Sep 26 02:52:31 CEST 2012


On Tue, Sep 25, 2012 at 5:48 PM, Dan Tenenbaum <dtenenba at fhcrc.org> wrote:
> On Mon, Sep 24, 2012 at 11:28 PM, seth redmond <seth.redmond at pasteur.fr> wrote:
>> I'm not too worried about set-up times; once I'd bootstrapped libraries into
>> place I could control the proportion of setup time per node by increasing
>> the granularity - besides, with the volume of data I'm looking at I don't
>> expect it to be a major issue, but having to switch between MPI and hadoop
>> for my clusters will be.
>>
>> This particular case seems to be to be a simple package dependency (though
>> I'm not sure recompiling R on the EMR image is something I want to get
>> into), however it's not likely to be the last one I run into. So I'm
>> wondering how complex it would be to, for instance, compile an R library on
>> the machine image and then transfer it into place for each run? I guess this
>> would be a factor of how many dependencies bioC has outside of the packages?
>> (for running, that is, not compiling) - obviously samtools, and similar, but
>> I'm thinking more of library dependencies that will be harder to debug.
>
> Can't you accomplish these things with bootstrap scripts?
> Here is an example of bootstrapping recent R into EMR:
> http://www.r-bloggers.com/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/


I meant to add: You can use the same technique to install BioC package
dependencies, e.g.:

R -e "source('http://bioconductor.org/biocLite.R');biocLite(c("dep1",
"dep2", "dep3"))"

Dan

>
>>
>> I take it the EMR portion of the bioC-in-the-cloud project has been dropped?
>>
>
> There never really was an EMR portion.
> We are interested in hearing about compelling use cases, though.
> Dan
>
>
>> -s
>>
>>
>> --
>> Seth Redmond
>>   Unité Génetique et Génomique des Insectes Vecteurs
>>   Institut Pasteur
>>   28,rue du Dr Roux
>>   75724 PARIS
>> seth.redmond at pasteur.fr
>>
>> On 24 Sep 2012, at 19:50, Dan Tenenbaum wrote:
>>
>> On Mon, Sep 24, 2012 at 9:42 AM, seth redmond <seth.redmond at pasteur.fr>
>> wrote:
>>
>> I'm trying to install some bioC modules on EC2 / Elastic Mapreduce but I'm
>> running into some library errors when installing (error below). Whilst I
>> could install them locally on each machine, if possible I'd rather avoid the
>> overhead both in terms of bootstrapping the machines, and having to check
>> for library errors whenever I write a new method.
>>
>>
>> Does anyone have any experience of running bioC in the cloud in this manner,
>> and has tried, for instance, building a library in an S3 bucket and running
>> directly from there, or porting the R lib wholesale when starting up the
>> nodes? or is it possible to use the BioC AWS image in EMR somehow?
>>
>>
>>
>> From what I have been able to tell, AWS EMR is not very usable with R.
>> It takes longer to load packages on each mapper/reducer than it does
>> to run the calculation I am trying to parallelize.
>>
>> I've looked at other strategies like RHIPE, or good old MPI.
>> Dan
>>
>>
>>
>> thanks
>>
>>
>> -s
>>
>>
>>
>> * Installing *source* package 'DNAcopy' ...
>>
>> ** libs
>>
>> gfortran   -fpic  -g -O2 -c changepoints.f -o changepoints.o
>>
>> gcc -std=gnu99 -I/usr/share/R/include      -fpic  -g -O2 -c flchoose.c -o
>> flchoose.o
>>
>> gcc -std=gnu99 -I/usr/share/R/include      -fpic  -g -O2 -c fphyper.c -o
>> fphyper.o
>>
>> gcc -std=gnu99 -I/usr/share/R/include      -fpic  -g -O2 -c fpnorm.c -o
>> fpnorm.o
>>
>> gfortran   -fpic  -g -O2 -c getbdry.f -o getbdry.o
>>
>> gfortran   -fpic  -g -O2 -c hybcpt.f -o hybcpt.o
>>
>> gfortran   -fpic  -g -O2 -c prune.f -o prune.o
>>
>> gcc -std=gnu99 -I/usr/share/R/include      -fpic  -g -O2 -c rshared.c -o
>> rshared.o
>>
>> gfortran   -fpic  -g -O2 -c segmentp.f -o segmentp.o
>>
>> gcc -std=gnu99 -shared  -o DNAcopy.so changepoints.o flchoose.o fphyper.o
>> fpnorm.o getbdry.o hybcpt.o prune.o rshared.o segmentp.o  -lgfortran -lm
>> -L/usr/lib64/R/lib -lR
>>
>> /usr/bin/ld: cannot find -lgfortran
>>
>> collect2: ld returned 1 exit status
>>
>> make: *** [DNAcopy.so] Error 1
>>
>> ERROR: compilation failed for package 'DNAcopy'
>>
>> ** Removing '/home/hadoop/R/x86_64-pc-linux-gnu-library/2.7/DNAcopy'
>>
>>
>> The downloaded packages are in
>>
>>        /tmp/RtmpxSeilp/downloaded_packages
>>
>>
>>
>> --
>>
>> Seth Redmond
>>
>>  Unité Génetique et Génomique des Insectes Vecteurs
>>
>>  Institut Pasteur
>>
>>  28,rue du Dr Roux
>>
>>  75724 PARIS
>>
>> seth.redmond at pasteur.fr
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>>
>> Bioconductor mailing list
>>
>> Bioconductor at r-project.org
>>
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>



More information about the Bioconductor mailing list