[R] Principal Component Analysis - Selecting components? + right choice?

Ravi Varadhan RVaradhan at jhmi.edu
Thu Dec 11 18:30:51 CET 2008


It is generally not the case that the best PC set, say, the top k PCs (where
k < p, p being the number of predcitors) contain the best predictor subset
in linear regression.  Hadi and Ling (Amer Stat, 1998) show that it is even
possible to have an extreme situation where the first (p-1) PCs contribute
nothing towards explaining the variation in the response, yet the last PC
alone contributes everything.   Their theorem is that if the true vector of
regression coefficients is in the direction of the j-th eigenvector (of the
correlation matrix), then the j-th PC alone will contribute everything to
the model fit, while the remaining PCs will contribute zilch.  They
illustrate this phenomenon with a "real" data set from a classic text on
regression, Draper and Smith.


Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: rvaradhan at jhmi.edu

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of S Ellison
Sent: Thursday, December 11, 2008 9:37 AM
To: r-help at r-project.org; Corrado
Subject: Re: [R] Principal Component Analysis - Selecting components? +
right choice?

If you're intending to create a model using PCs as predictors, select the
PCs based on whether they contribute significanctly to the model fit.

In chemometrics (multivariate stats in chemistry, among other things), if
we're expecting 3 or 4 PC's to be useful in a principal component
regression, we'd generally start with at least the first half-dozen or so
and let the model fit sort them out.

The reason for not preselecting too rigorously early on is that there's no
guarantee at all that the first couple of PC's are good predictors for what
you're interested in. The're properties of the predictor set, not of the
response set.

Mind you, there used to be something of a gap between chemometrics and
proper statistics; I'm sure chemometricians used to do things with data that
would turn a statistician pale. 

You could also look for a PLS model, which (if I recall correctly) actually
uses the response data to select the latent variables used for prediction.


>>> Corrado <ct529 at york.ac.uk> 11/12/2008 11:46:37 >>>
Dear R gurus,

I have some climatic data for a region of the world. They are monthly
averages 1950 -2000 of precipitation (12 months), minimum temperature (12
months), maximum temperature (12 months). I have scaled them to 2 km x 2km
cells, and I have around 75,000 cells.

I need to feed them into a statistical model as co-variates, to use them to
predict a response variable.

The climatic data are obviously correlated: precipitation for January is
correlated to precipitation for February and so on .... even precipitation
and temperature are heavily correlated. I did some correlation analysis and
they are all strongly correlated.

I though of running PCA on them, in order to reduce the number of
co-variates I feed into the model.

I run the PCA using prcomp, quite successfully. Now I need to use a criteria
to select the right number of PC. (that is: is it 1,2,3,4?)

What criteria would you suggest?

At the moment, I am using a criteria based on threshold, but that is highly
subjective, even if there are some rules of thumb (Jolliffe,Principal
Component Analysis, II Edition, Springer Verlag,2002). 

Could you suggest something more rigorous?

By the way, do you think I would have been better off by using something
different from PCA?

Corrado Topi

Global Climate Change & Biodiversity Indicators Area 18,Department of
Biology University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk 

R-help at r-project.org mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.

This email and any attachments are confidential. Any use...{{dropped:8}}

More information about the R-help mailing list