[R] RE: R performance questions

Michael Benjamin msb1129 at bellsouth.net
Thu Dec 4 04:21:02 CET 2003


Hi--

While I agree that we cannot agree on the ideal algorithms, we should be
taking practical steps to implement microarrays in the clinic.  I think
we can all agree that our algorithms have some degree of efficacy over
and above conventional diagnostic techniques.  If patients are dying
from lack of diagnostic accuracy, I think we have to work hard to use
this technology to help them, if we can.  I think we can, even now.

What if I offer, in my clinic, a service for cancer patients to compare
their affy data to an existing set of data, to predict their prognosis
or response to chemotherapy?  I think people will line up out the door
for such a service.  Knowing what we as a group of array analyzers know,
wouldn't we all want this kind of service available if we or a loved one
got cancer?

Can our programs deal with 1,000 .cel files?  10,000 files?  

I think our programs are pretty good, but what we need is DATA.  We must
be careful what we wish for--we might get it!  So how do we measure
whether analyzing 10,000 .cel files with library(affy) is feasible?  I'm
assuming that advanced hardware would be required for such a task.  What
are the critical components of such a platform?  How much money would a
feasible system for array analysis cost?

I was just looking ahead two or three years--where is all this genomic
array research headed?  I guess I'm concerned about scalability.  

Is anyone really working on implementing affy on a cluster/Beowulf?
That sounds like a real challenge.

Regards,
Michael Benjamin, MD
-----Original Message-----
From: Liaw, Andy [mailto:andy_liaw at merck.com] 
Sent: Wednesday, December 03, 2003 9:47 PM
To: 'Michael Benjamin'
Subject: RE: [BioC] R performance questions

Another point about benchmarking:  As has been discussed on R-help
before,
benchmarks can be misleading, as the one you mentioned.  It measures
linear
algebra tasks, etc., but that typically account for very small portion
of
"average" tasks.  Doug Bates also pointed out that the eigen() example
used
in that benchmark is computing mostly meaningless results.

In our experience, learning to use R more efficiently gives us the most
mileage, but large and fast hardware wouldn't hurt...

Cheers,
Andy

> -----Original Message-----
> From: Michael Benjamin [mailto:msb1129 at bellsouth.net] 
> Sent: Wednesday, December 03, 2003 7:32 PM
> To: 'Liaw, Andy'
> Subject: RE: [BioC] R performance questions
> 
> 
> Thanks.
> Mike
> 
> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com] 
> Sent: Wednesday, December 03, 2003 8:17 AM
> To: 'Michael Benjamin'
> Subject: RE: [BioC] R performance questions
> 
> Hi Michael,
> 
> Just one comment about SVM.  If you use the svm() function in 
> the e1071
> package to train linear SVM, it will be rather slow.  That's a known
> limitation of libsvm, of which the svm() function uses.  If you are
> willing
> to go outside of R, the "bsvm" package by C.J. Lin (same person who
> wrote
> libsvm) will train linear svm in much more efficient manner.
> 
> HTH,
> Andy
> 
> > -----Original Message-----
> > From: bioconductor-bounces at stat.math.ethz.ch 
> > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of 
> > Michael Benjamin
> > Sent: Tuesday, December 02, 2003 10:30 PM
> > To: bioconductor at stat.math.ethz.ch
> > Subject: [BioC] R performance questions
> > 
> > 
> > Hi, all--
> > 
> > I wanted to start a thread on R speed/benchmarking.  There 
> is a nice R
> > benchmarking overview at 
> http://www.sciviews.org/other/benchmark.htm,
> > along with a 
> free script so you can see how your machine stacks up.
> > 
> > Looks like R is substantially faster than S-plus.
> > 
> > My problem is this: with 512Mb and an overclocked AMD 
> Athlon XP 1800+,
> > running at 588 SPEC-FP 2000, it still takes FOREVER to 
> > analyze multiple
> > .cel files using affy (expresso).  Running svm takes a mighty 
> > long time
> > with more than 500 genes, 150 samples.
> > 
> > Questions:
> > 1) Would adding RAM or processing speed improve performance 
> the most?
> > 2) Is it possible to run R on a cluster without rewriting my 
> > high-level
> > code?  In other words,
> > 3) What are we going to do when we start collecting 
> terabytes of array
> > data to analyze?  There will come a "breaking point" at 
> which desktop
> > systems can't perform these analyses fast enough for large 
> > quantities of
> > data.  What then?
> > 
> > Michael Benjamin, MD
> > Winship Cancer Institute
> > Emory University,
> > Atlanta, GA
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> > 
> 
> 
> 
>




More information about the R-help mailing list