PCA with n << p (was [R] R-1.6.0 crashing on RedHat6.3)

Warnes, Gregory R gregory_r_warnes at groton.pfizer.com
Tue Oct 29 17:34:29 CET 2002


	      [Moderator's Note: This message needed manual interaction by me,
	       since the attachment originally was declared as
	        ``application/octet-stream'' even though it was only plain
	       text.  We do not allow octet-stream (aka binary!)
	       attachments on our mailing list -- for virus/spam filtering
	       reasons.  -- MM]

We have also encountered the problem Douglas discusses with prcomp.
Unfortunately, svd() function used by both R and S-Plus is very
inefficient when passed matrix that is very wide, but is very fast
when passed the transpose of  the same matrix.

Attached is a small file that resolved the problem for us.  It
defines two functions, fast.prcomp()  and fast.svd().

fast.svd() is simply a wrapper around the regular svd() that
checks if the number of columns is larger than the number of
rows.  If this is the case, it transposes everything, calls
svd, and then transposes back.  Otherwise it just calls svd
and returns the results.

fast.prcomp() is simply the R prcomp() function  from the mds
library modified to call fast.svd().   This allows it to be
used without worrying about whether the matrix is long (cluster chips)
or wide (cluster probesets).

I've tested this code on R 1.5.1, 1.6.0 and on S-Plus 2000.

-Greg


> -----Original Message-----
> From: Douglas Grove [mailto:dgrove at fhcrc.org]
> Sent: Tuesday, October 29, 2002 1:59 AM
> To: ripley at stats.ox.ac.uk
> Cc: Peter Dalgaard BSA; r-help at stat.math.ethz.ch
> Subject: Re: PCA with n << p (was [R] R-1.6.0 crashing on RedHat6.3)
>
>
> > princomp is the wrong tool here: prcomp is better (and a
> version using
> > La.svd would be better still).
>
> Okay. I used princomp as I saw something in your book regarding using
> this and NOT prcomp in S-plus, I thought the same would hold in R.
>
>
>
> > What do you want to do with a PCA of such a matrix?  We can almost
> > certainly give you a better way using La.svd directly.
>
> Looking at microarray data with approx. 5300 'genes' and 144 samples
> (72 control, 72 experimental).  Trying to see how well principal
> components can separate the different tissue types (Liver,
> Kidney, Testis)
> making up the samples.
>
> In any case, this code is just an example of the problem we're having
> with R (and S-plus) crashing with memory allocation related errors.
> I used this code since no one could poke holes in it :-)
>
> Thanks for the info on prcomp.  I'll try it out.
>
> Doug
>
>
>
> > BDR
> >
> > On 28 Oct 2002, Peter Dalgaard BSA wrote:
> >
> > > Douglas Grove <dgrove at fhcrc.org> writes:
> > >
> > > > > Are you sure that it is 6.3?? To my knowledge, there
> is nothing
> > > > > between 6.2 and 7.0. What's in /etc/redhat-release ?
> > > >
> > > > Sorry, I was told it was running 6.3. I just checked and
> > > > it's running 6.2.
> > >
> > > OK, so the Fortran problems are there, but the usual
> symptom of that
> > > is crash-on-load.
> > >
> > >
> > > > > The Fortran in RH6.x was rather badly broken for some
> packages, but
> > > > > one would expect that you had run into that before.
> 1.6.0 has a memory
> > > > > leak but it generally affects repeated applications
> of model fits,
> > > > > rather than big matrices.
> > > > >
> > > > > Do you really mean 144x5300 ? (more columns than
> rows) That's big: The
> > > > > covariance matrix at 5300x5300 will take more than
> 200 MB (OK, it
> > > > > might only be storing upper or lower triangle.) I
> tried a matrix like
> > > > > that on a 1.6.1beta system with about 0.75 GB and got
> an out of memory
> > > > > error. A 144x2500 problem is currently running in
> > > > >
> > > > >   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU
> %MEM   TIME COMMAND
> > > > > 16111 pd        17   0  207M 163M 18536 R    99.8
> 65.8   1:58 R.bin
> > > > >
> > > > > and seems to be staying there....
> > > >
> > > > Yep, it's 144x5300.  The machine has 2GB of RAM, and
> this uses about 1.5GB.
> > >
> > > Hmm. My half-size toy version conked out with
> > >
> > > > D <- matrix(rnorm(2500*144),ncol=2500)
> > > > library(mva)
> > > > pc.norm <- princomp(D,scores=FALSE)
> > > Error in princomp.default(D, scores = FALSE) :
> > >         covariance matrix is not non-negative definite
> > >
> > > which is a bit odd, but at least it didn't run out of
> memory. However,
> > > the tolerances seem to require some tweaking!
> > >
> > > --
> > >    O__  ---- Peter Dalgaard             Blegdamsvej 3
> > >   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
> > >  (*) \(*) -- University of Copenhagen   Denmark      Ph:
> (+45) 35327918
> > > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX:
> (+45) 35327907
> > >
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.-.-
> > > r-help mailing list -- Read
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > Send "info", "help", or "[un]subscribe"
> > > (in the "body", not the subject !)  To:
> r-help-request at stat.math.ethz.ch
> > >
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._._._
> > >
> >
> >
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.-.-
> r-help mailing list -- Read
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To:
> r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._._._
>



LEGAL NOTICE
Unless expressly stated otherwise, this message is confidential and may be privileged. It is intended for the addressee(s) only. Access to this E-mail by anyone else is unauthorized. If you are not an addressee, any disclosure or copying of the contents of this E-mail or any action taken (or not taken) in reliance on it is unauthorized and may be unlawful. If you are not an addressee, please inform the sender immediately.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fast.prcomp.R
Type: plain/text
Size: 2020 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-help/attachments/20021029/e6190a76/fast.prcomp.bin


More information about the R-help mailing list