[R] Principal Component Analysis - Selecting components? + right choice?

Corrado ct529 at york.ac.uk
Wed Dec 17 12:34:24 CET 2008


I have been testing some of the alternative suggested approaches. The best PC 
set may not be the best predictors subset, but is that true that it is not 
generally the case? If you have to explore data patterns and (potential) 
relationships between a response variables and a large set of candidate 
predictors, PC still seem to be best candidate for a relatively quick test. I 
think some time you have to trade off against time (for example: computing 
time), and if any pattern emerges from response vs . first k PC then you 
investigate further .... am I completely wrong there? what alternative do you 
have that reduces so drastically the computation request for exploratory 

Furthermore, is it really generally not the case that the best PC set, say, 
the top k PCs contain the best predictor subset in linear regression, or does 
that happens only in specific situations (that is, generally the best PC set 
is actually a good set of predictors, but in some specific cases it is not)?


On Thursday 11 December 2008 17:30:51 you wrote:
> Hi,
> It is generally not the case that the best PC set, say, the top k PCs
> (where k < p, p being the number of predcitors) contain the best predictor
> subset in linear regression.  Hadi and Ling (Amer Stat, 1998) show that it
> is even possible to have an extreme situation where the first (p-1) PCs
> contribute nothing towards explaining the variation in the response, yet
> the last PC alone contributes everything.   Their theorem is that if the
> true vector of regression coefficients is in the direction of the j-th
> eigenvector (of the correlation matrix), then the j-th PC alone will
> contribute everything to the model fit, while the remaining PCs will
> contribute zilch.  They illustrate this phenomenon with a "real" data set
> from a classic text on regression, Draper and Smith.
> Ravi.
> ---------------------------------------------------------------------------
>- -------
> Ravi Varadhan, Ph.D.
> Assistant Professor, The Center on Aging and Health
> Division of Geriatric Medicine and Gerontology
> Johns Hopkins University
> Ph: (410) 502-2619
> Fax: (410) 614-9625
> Email: rvaradhan at jhmi.edu
> Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
> ---------------------------------------------------------------------------
>- --------
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of S Ellison
> Sent: Thursday, December 11, 2008 9:37 AM
> To: r-help at r-project.org; Corrado
> Subject: Re: [R] Principal Component Analysis - Selecting components? +
> right choice?
> If you're intending to create a model using PCs as predictors, select the
> PCs based on whether they contribute significanctly to the model fit.
> In chemometrics (multivariate stats in chemistry, among other things), if
> we're expecting 3 or 4 PC's to be useful in a principal component
> regression, we'd generally start with at least the first half-dozen or so
> and let the model fit sort them out.
> The reason for not preselecting too rigorously early on is that there's no
> guarantee at all that the first couple of PC's are good predictors for what
> you're interested in. The're properties of the predictor set, not of the
> response set.
> Mind you, there used to be something of a gap between chemometrics and
> proper statistics; I'm sure chemometricians used to do things with data
> that would turn a statistician pale.
> You could also look for a PLS model, which (if I recall correctly) actually
> uses the response data to select the latent variables used for prediction.
> S
> >>> Corrado <ct529 at york.ac.uk> 11/12/2008 11:46:37 >>>
> Dear R gurus,
> I have some climatic data for a region of the world. They are monthly
> averages 1950 -2000 of precipitation (12 months), minimum temperature (12
> months), maximum temperature (12 months). I have scaled them to 2 km x 2km
> cells, and I have around 75,000 cells.
> I need to feed them into a statistical model as co-variates, to use them to
> predict a response variable.
> The climatic data are obviously correlated: precipitation for January is
> correlated to precipitation for February and so on .... even precipitation
> and temperature are heavily correlated. I did some correlation analysis and
> they are all strongly correlated.
> I though of running PCA on them, in order to reduce the number of
> co-variates I feed into the model.
> I run the PCA using prcomp, quite successfully. Now I need to use a
> criteria to select the right number of PC. (that is: is it 1,2,3,4?)
> What criteria would you suggest?
> At the moment, I am using a criteria based on threshold, but that is highly
> subjective, even if there are some rules of thumb (Jolliffe,Principal
> Component Analysis, II Edition, Springer Verlag,2002).
> Could you suggest something more rigorous?
> By the way, do you think I would have been better off by using something
> different from PCA?
> Best,
> --
> Corrado Topi
> Global Climate Change & Biodiversity Indicators Area 18,Department of
> Biology University of York, York, YO10 5YW, UK
> Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> *******************************************************************
> This email and any attachments are confidential. Any u...{{dropped:19}}

More information about the R-help mailing list