[R] Principal Component Analysis - Selecting components? + right choice?

S Ellison S.Ellison at lgc.co.uk
Thu Dec 11 15:36:55 CET 2008


If you're intending to create a model using PCs as predictors, select
the PCs based on whether they contribute significanctly to the model
fit.

In chemometrics (multivariate stats in chemistry, among other things),
if we're expecting 3 or 4 PC's to be useful in a principal component
regression, we'd generally start with at least the first half-dozen or
so and let the model fit sort them out.

The reason for not preselecting too rigorously early on is that there's
no guarantee at all that the first couple of PC's are good predictors
for what you're interested in. The're properties of the predictor set,
not of the response set.

Mind you, there used to be something of a gap between chemometrics and
proper statistics; I'm sure chemometricians used to do things with data
that would turn a statistician pale. 

You could also look for a PLS model, which (if I recall correctly)
actually uses the response data to select the latent variables used for
prediction.

S

>>> Corrado <ct529 at york.ac.uk> 11/12/2008 11:46:37 >>>
Dear R gurus,

I have some climatic data for a region of the world. They are monthly
averages 
1950 -2000 of precipitation (12 months), minimum temperature (12
months), 
maximum temperature (12 months). I have scaled them to 2 km x 2km
cells, and 
I have around 75,000 cells.

I need to feed them into a statistical model as co-variates, to use
them to 
predict a response variable.

The climatic data are obviously correlated: precipitation for January
is 
correlated to precipitation for February and so on .... even
precipitation 
and temperature are heavily correlated. I did some correlation analysis
and 
they are all strongly correlated.

I though of running PCA on them, in order to reduce the number of
co-variates 
I feed into the model.

I run the PCA using prcomp, quite successfully. Now I need to use a
criteria 
to select the right number of PC. (that is: is it 1,2,3,4?)

What criteria would you suggest?

At the moment, I am using a criteria based on threshold, but that is
highly 
subjective, even if there are some rules of thumb (Jolliffe,Principal 
Component Analysis, II Edition, Springer Verlag,2002). 

Could you suggest something more rigorous?

By the way, do you think I would have been better off by using
something 
different from PCA?

Best,
-- 
Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk 

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code.

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}



More information about the R-help mailing list