[R] Amazon AWS, RGenoud, Parallel Computing

Mike Marchywka marchywka at hotmail.com
Sat Jun 11 22:09:29 CEST 2011












----------------------------------------
> Date: Sat, 11 Jun 2011 19:57:47 +0200
> Subject: Re: [R] Amazon AWS, RGenoud, Parallel Computing
> From: lui.r.project at googlemail.com
> To: marchywka at hotmail.com
> CC: r-help at r-project.org
>
> Hello Mike,
>
[[elided Hotmail spam]]
> Best to my knowledge the sort algorithm implemented in R is already
> "backed by C++" code and not natively written in R. Writing the code
> in C++ is not really an option either (i think rGenoud is also written
> in C++). I am not sure whether there really is a "bottleneck" with
> respect to the computer - I/O is pretty low, plenty of RAM left etc.
> It really seems to me as if parallelizing is not easily possible or
> only at high costs so that the benefits diminish through all the
> coordination and handling needed...
> Did anybody use rGenoud in "cluster mode" an experience sth similar?
> Are quicksort packages available using multiple processors efficiently
> (I didnt find any... :-( ).

I'm no expert but these don't seem to be terribly subtle problems
in most cases. Sure, if the task is not suited to parallelism and
you force it to be parallel and it spends all its time syncing
up, that can be a problem. Just making more tasks to fight over
the bottle neck- memory, CPU, locks- can easily make things worse.
I think I posted my link earlier on IEEE blurb showing 
how easy it is for many cores to make things worse on non-contrived
benchmarks.




>
> I am by no means an expert on parallel processing, but is it possible,
> that benefits from parallelizing a process greatly diminish if a large
> set of variables/functions need to be made available and the actual
> function (in this case sorting a few hundred entries) is quite short
> whereas the number of times the function is called is very high!? It
> was quite striking that the "first run" usually took several hours
> (instead of half an hour) and the subsequent runs were much much
> faster..
>
> There is so much happening "behind the scenes" that it is a little
> hard for me to tell what might help - and what will not...
>
> Help appreciated :-)
> Thank you
>
> Lui
>
> On Sat, Jun 11, 2011 at 4:42 PM, Mike Marchywka  wrote:
> >
> >
> >
> >
> > ----------------------------------------
> >> Date: Sat, 11 Jun 2011 13:03:10 +0200
> >> From: lui.r.project at googlemail.com
> >> To: r-help at r-project.org
> >> Subject: [R] Amazon AWS, RGenoud, Parallel Computing
> >>
> >> Dear R group,
> >>
> >>
> > [...]
> >
> >> I am a little bit puzzled now about what I could do... It seems like
> >> there are only very limited options for me to increase the
> >> performance. Does anybody have experience with parallel computations
> >> with rGenoud or parallelized sorting algorithms? I think one major
> >> problem is that the sorting happens rather quick (only a few hundred
> >> entries to sort), but needs to be done very frequently (population
> >> size >2000, iterations >500), so I guess the problem with the
> >> "housekeeping" of the parallel computation deminishes all benefits.
> >>
> > Your sort is part of algorithm or you have to sort results after
> > getting then back out of order from async processes? One of
> > my favorite anecdotes is how I used a bash sort on huge data
> > file to make program run faster ( from impractical zero percent CPU
> > to very fast with full CPU usage and you complain about exactly
> > a lack of CPU saturation). I guess a couple of comments. First,
> > if you have specialized apps you need optimized, you may want
> > to write dedicated c++ code. However, this won't help if
> > you don't find the bottleneck. Lack of CPU saturation could
> > easily be due to "waiting for stuff" like disk IO or VM
> > swap. You really ought to find the bottle neck first, it
> > could be anything ( except the CPU maybe LOL). The sort
> > that I used prevented VM thrashing with no change to the app
> > code- the app got sorted data and so VM paging became infrequent.
> > If you can specify the problem precisely you may be able to find
> > a simple solution.
> >
> >
 		 	   		  


More information about the R-help mailing list