[R] Questions about results from PCAproj for robust principal component analysis
P.Filzmoser at tuwien.ac.at
Wed Feb 14 20:43:13 CET 2007
PCAproj is mainly designed for robust PCA and not for classical PCA.
Therefore, when applying classical estimators to the results of a
robust PCA, like the mean to the robust PCA scores, this will usually
not give zeros. The robust PCs have been centred robustly, and
not classically by the mean.
In your case you ran PCAproj with the default method="mad", thus
robust PCA was performed, maximizing the mad instead of the usual
variance (standard deviation). Moreover, you used the default for
centring the PCs, which is center="l1median", so a robust centring
rather than centring by the classical mean (which would have been
the zero vector because your data were already classically centred).
Run the procedure with the options
center=mean and method="sd"
which then gives classical PCA. The mean of the resulting PCA
scores will be 0. However, the scores will not be orthogonal
(or uncorrelated), but close to. The reason is that PCAproj
is an axxproximative algorithm for finding the (robust) PCs.
The eigen-analysis of princomp gives the exact solution (but
it is not robust).
The wrong length of result$scale and result$center is definitely
an error in the procedure that I will have to change quickly.
For data sets with more columns than rows we automatically
apply a singular value decomposition (SVD) to reduce the
dimensionality (without information loss). Then we perform
centring and scaling in the space of reduced dimension, but
we did not back-transform these values - sorry. Will be repaired
Talbot Katz wrote:
> I have been looking at the PCAproj function in package pcaPP (R 2.4.1)
> for robust principal components, and I'm trying to interpret the
> results. I started with a data matrix of dimensions RxC (R is the
> number of rows / observations, C the number of columns / variables).
> PCAproj returns a list of class princomp, similar to the output of the
> function princomp. In a case where I can run princomp, I would get the
> following, from executing dmpca = princomp(datamatrix) :
> - the vector, sdev, of length C, contains the standard deviations of
> the components in
> order by descending value; the squares are the eigenvalues of the
> covariance matrix
> - the matrix, loadings, has dimension CxC, and the columns are the
> eigenvectors of the
> covariance matrix, in the same order as the sdev vector; the
> columns are
> sum(dmpca$loadings[,i]*dmpca$loadings[,j]) = 1 if i == j, ~ 0 if
> i != j
> - the vector, center, of length C, contains the means of the variable
> columns in the original
> data matrix, in the same order as the original columns
> - the vector, scale, of length C, contains the scalings applied to
> each variable, in the same
> order as the original columns
> - n.obs contains the number of observations used in the computation;
> this number equals
> R when there is no missing data
> - the matrix, scores, has dimension RxC, and it can be thought of as
> the projection of the
> eigenvector matrix, loadings, back onto the original data; these
> columns of
> scores are the principal components. princomp typically removes
> the mean,
> so the formula is:
> dmpca$scores = t(t(datamatrix) - dmpca$center)%*%dmpca$loadings
> and apply(dmpca$scores,2,mean) returns a length C vector of
> zeroes; also the principal components (columns of scores) are
> (but not orthonormal):
> sum(dmpca$scores[,i]*dmpca$scores[,j]) ~ 0 if i != j, > 0 if i == j
> - call contains the function call, in this case princomp(x = datamatrix)
> That is all as it should be.
> In my case R < C, which produces singular results for standard PCA, but
> robust methods, like PCAproj, are designed to handle this. Also, I had
> "de-meaned" the data beforehand, so apply(datamatrix,2,mean) produces a
> length C vector of (effectively) zeroes. I ran the following:
> to get the first four robust components. When I look at the princomp
> object returned as dmpcaprj, some of the results are just what I
> expect. For example,
> - dmpcaprj$loadings has dimensions Cx4, as expected, and the first
> four eigenvectors of
> the (robust) covariance matrix are orthonormal:
> sum(dmpcaprj$loadings[,i]*dmpcaprj$loadings[,j]) = 1 if i == j,
> ~ 0 if i != j
> - dmpcaprj$sdev contains the square roots of the four corresponding
> - dmpcaprj$n.obs equals R.
> - dmpcaprj$scores has dimensions Rx4, as it should.
> HOWEVER, the columns of dmpcaprj$scores are neither de-meaned nor
> orthogonal. So,
> apply(dmpcaprj$scores,2,mean) is a non-zero vector, and
> sum(dmpcaprj$scores[,i]*dmpcaprj$scores[,j]) != 0 if i != j, > 0
> if i == j
> - dmpcaprj$scale is in this case a vector of all 1's, as expected.
> But the length is C, not R.
> - dmpcaprj$center is a vector of length C, not R, and the entries are
> not equal to either
> apply(datamatrix,1,mean) or apply(datamatrix,2,mean); I can't
> figure out
> where they came from.
> One interesting thing is that the columns of the Rx4 matrix,
> dmpcaprj$scores - datamatrix%*%dmpcaprj$loadings
> are all identically constant vectors, such that each row equals
> apply(dmpcaprj$scores,2,mean), since
> apply(datamatrix%*%dmpcaprj$loadings,2,mean) is a length four vector of
> (effectively) zeroes,
> but I can't interpret the values of these means of dmpcaprj$scores.
> Can anyone please explain to me what is happening with the scores,
> scale, and center parts of the PCAproj results?
> -- TMK --
> 212-460-5430 home
> 917-656-5351 cell
From: Prof. Dr. Peter Filzmoser
Dept. of Statistics & Probability Theory
Vienna University of Technology
Wiedner Hauptstrasse 8-10
A-1040 Vienna, Austria
Tel. +43 1 58801/10733
Fax. +43 1 58801/10799
E-mail: P.Filzmoser at tuwien.ac.at
More information about the R-help