[BioC] parallel mods to affy package

Warnes, Gregory R gregory_r_warnes@groton.pfizer.com
Fri, 4 Oct 2002 12:05:06 -0400

> -----Original Message-----
> From: Vincent Carey 525-2265 [mailto:stvjc@channing.harvard.edu]
> >
> > I'm just starting to look at integrating Luke Tierney's 
> 'snow' package with
> > the 'affy' package in order to parallelize the work.
> >
> > Initially, I'm planning on modifying 'express' by adding a 
> new parameter
> > "cl" for cluster.    Next I'll probably tackle ReadAffy and friends.
> >
> > 1) Comments on the plan?
> seems a worthy endeavor, but
> does the source of the affy package really need to be modified
> for this?  can't wrappers be written that break up the
> problem and reassemble the results?  keep the package distinct
> from the various modes of execution

Actually, it does look simplest to modify the source of the affy package.
'apply' and friends are already being used in the right places, and the
changes are simple substitutions like:

	if( missing(cl) )
	  # do the normal apply thing
	  # do the parallel apply thing

There would need to be quite a bit more thought -- and probably
synchronization -- required to properly split up the data before, run the
affy functions on the subsets, then reassemble the data.  The basic problem
is knowing which functions have data dependencies that prevent
parallelization and which don't.  

It would be painful, from the outside, to do

	split data for fun1
	run fun1 in parallel
	join the data for un-parallelizable fun2
	run fun2 on all the data
	split the data for fun3
	run fun 3 in parallel

especially since some of the alternative approaches a particular step allow
easy parallelization, and some don't.  So, for instance, quantile
normalization isn't trivially parellizable by splitting along chips, while
globally scaling the trimmed mean to, say, 300 is trivially parallelizable.
Knowing when to split and join will be a problem that requires examining the
code for each potential function.  Once you hit that level, its easier to
just modify the functions themselves.



Unless expressly stated otherwise, this message is confidential and may be privileged. It is intended for the addressee(s) only. Access to this E-mail by anyone else is unauthorized. If you are not an addressee, any disclosure or copying of the contents of this E-mail or any action taken (or not taken) in reliance on it is unauthorized and may be unlawful. If you are not an addressee, please inform the sender immediately.