[R] PCA sensitive to outliers?

Martin Maechler maechler at stat.math.ethz.ch
Mon Apr 23 10:08:33 CEST 2012


>>>>> "SL" == Steve Lianoglou <mailinglist.honeypot at gmail.com>
>>>>>     on Mon, 23 Apr 2012 01:10:31 -0400 writes:

    SL> On Mon, Apr 23, 2012 at 12:01 AM, Michael
    SL> <comtech.usa at gmail.com> wrote:
    >> yes, but that is not a good Review or Survey... thx

    SL> But the packages listed there do have their own
    SL> documentation and vignettes. For instance the rrcov
    SL> package seems to have a nice vignette about its design
    SL> as well as methods it implements, and references to
    SL> these methods for further reading:

    SL> http://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf

    SL> You'll see at least a few mentions of PCA, which will
    SL> lead you to other package/papers/etc.

Yes, indeed, thanks Steve!

Unfortunately, the topic of robust PCA 
is not quite trivial, and has been approached (too) many times...

As maintainer of the robust task view, I'd indeed strongly
recommend working with 'rrcov' or 'robustbase' which already
contains an important subset of rrcov's robust covariance matrix
estimators. 

Note that the historically earliest robust covariance estimator
available in an R package is  cov.rob() from MASS ('Recommended'
package available with every R).
*And* you can use standard R's  
      princomp(x, ... , covmat = <robust.cov>(x))
to get robust PCA.

I'll add a note with that to the 'Robust' CRAN task view.

Martin Maechler, 
ETH Zurich

    SL> Enjoy,
    SL> -steve

    >> 
    >> On Sun, Apr 22, 2012 at 9:47 PM, Bert Gunter
    >> <gunter.berton at gene.com> wrote:
    >> 
    >>> As I believe I already told you, look at the CRAN Robust
    >>> task view.
    >>> 
    >>> -- Bert
    >>> 
    >>> On Sun, Apr 22, 2012 at 6:29 PM, Michael
    >>> <comtech.usa at gmail.com> wrote: > Even in R, there are so
    >>> many of "robust PCA"... any survey or review of all >
    >>> these different methods?
    >>> >
    >>> > On Sun, Apr 22, 2012 at 6:58 PM, Joshua Wiley
    >>> <jwiley.psych at gmail.com >wrote:
    >>> >
    >>> >> On Sun, Apr 22, 2012 at 4:43 PM, Michael
    >>> <comtech.usa at gmail.com> wrote: >> > I actually tried
    >>> "robustPca" in "pcaMethods" on bioconductor.
    >>> >> >
    >>> >> > It keeps giving me the warning "Input data is not
    >>> complete"...
    >>> >> >
    >>> >> > Reading into the function:
    >>> >> >
    >>> >> > When there is no "NA"s, it will give this
    >>> warning...
    >>> >> >
    >>> >> > It seems that there is a bug in this code...
    >>> >> >
    >>> >> > Is it reliable at all?
    >>> >> >
    >>> >> > ---------------------
    >>> >> >
    >>> >> >
    >>> >> >> robustPcafunction (Matrix, nPcs = 2, verbose =
    >>> interactive(), ...)  >> > { >> >    nas <- is.na(Matrix)
    >>> >> >    if (!any(nas) & verbose) { >> >      
    >>>  cat("Input data is not complete.\n") >> >      
    >>>  cat("Scores, R2 and R2cum may be inaccurate, handle
    >>> with care\n") >> >    }
    >>> >>
    >>> >> that seems to issue the notes when there are *not any
    >>> missing* and >> verbose is TRUE.  I would submit a bug
    >>> report to the author.
    >>> >>
    >>> >> >
    >>> >> >
    >>> >> >
    >>> >> >
    >>> >> >
    >>> >> > On Fri, Apr 20, 2012 at 9:58 AM, Kevin Wright
    >>> <kw.stat at gmail.com> wrote:
    >>> >> >
    >>> >> >> You can also have a look at the pcaMethods package
    >>> on Bioconductor.
    >>> >> >>
    >>> >> >> Kevin
    >>> >> >>
    >>> >> >>
    >>> >> >>  On Thu, Apr 19, 2012 at 11:20 PM, Michael
    >>> <comtech.usa at gmail.com> >> wrote:
    >>> >> >>
    >>> >> >>>  Hi all,
    >>> >> >>>
    >>> >> >>> I found that the PCA gave chaotic results when
    >>> there are big changes >> in a >> >>> few data points.
    >>> >> >>>
    >>> >> >>> Are there "improved" versions of PCA in R that
    >>> can help with this >> problem?
    >>> >> >>>
    >>> >> >>> Please give me some pointers...
    >>> >> >>>
    >>> >> >>> Thank you!
    >>> >> >>>
    >>> >> >>>        [[alternative HTML version deleted]]
    >>> >> >>>
    >>> >> >>> ______________________________________________ >>
    >>> >>> R-help at r-project.org mailing list >> >>>
    >>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>>
    >>> PLEASE do read the posting guide >> >>>
    >>> http://www.R-project.org/posting-guide.html<
    >>> http://www.r-project.org/posting-guide.html> >>
    >>> <http://www.r-project.org/posting-guide.html> >>  >>>
    >>> and provide commented, minimal, self-contained,
    >>> reproducible code.
    >>> >> >>>
    >>> >> >>
    >>> >> >>
    >>> >> >>
    >>> >> >> --
    >>> >> >> Kevin Wright
    >>> >> >>
    >>> >> >>
    >>> >> >
    >>> >> >        [[alternative HTML version deleted]]
    >>> >> >
    >>> >> > ______________________________________________ >> >
    >>> R-help at r-project.org mailing list >> >
    >>> https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE
    >>> do read the posting guide >>
    >>> http://www.R-project.org/posting-guide.html<
    >>> http://www.r-project.org/posting-guide.html> >> > and
    >>> provide commented, minimal, self-contained, reproducible
    >>> code.
    >>> >>
    >>> >>
    >>> >>
    >>> >> --
    >>> >> Joshua Wiley >> Ph.D. Student, Health Psychology >>
    >>> Programmer Analyst II, Statistical Consulting Group >>
    >>> University of California, Los Angeles >>
    >>> https://joshuawiley.com/
    >>> >>
    >>> >
    >>> >        [[alternative HTML version deleted]]
    >>> >
    >>> > ______________________________________________ >
    >>> R-help at r-project.org mailing list >
    >>> https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the
    >>> posting guide http://www.R-project.org/posting-guide.html > and
    >>> provide commented, minimal, self-contained, reproducible code.
    >>> 
    >>> 
    >>> 
    >>> --
    >>> 
    >>> Bert Gunter Genentech Nonclinical Biostatistics
    >>> 
    >>> Internal Contact Info: Phone: 467-7374 Website:
    >>> 
    >>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
    >>> 
    >> 
    >>        [[alternative HTML version deleted]]
    >> 
    >> ______________________________________________
    >> R-help at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
    >> read the posting guide
    >> http://www.R-project.org/posting-guide.html and provide
    >> commented, minimal, self-contained, reproducible code.



    SL> -- Steve Lianoglou Graduate Student: Computational
    SL> Systems Biology  | Memorial Sloan-Kettering Cancer
    SL> Center  | Weill Medical College of Cornell University
    SL> Contact Info: http://cbio.mskcc.org/~lianos/contact

    SL> ______________________________________________
    SL> R-help at r-project.org mailing list
    SL> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
    SL> read the posting guide
    SL> http://www.R-project.org/posting-guide.html and provide
    SL> commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list