[R] Several PCA questions...

Tue Jun 29 17:49:13 CEST 2004

Perhaps this question is less dumb... (in context below...)

On Tue, 29 Jun 2004, Prof Brian Ripley wrote:

>On Tue, 29 Jun 2004, Dan Bolser wrote:
>
>> Hi, I am doing PCA on several columns of data in a data.frame.
>> 
>> I am interested in particular rows of data which may have a particular
>> combination of 'types' of column values (without any pre-conception of
>> what they may be).
>> 
>> I do the following...
>> 
>> # My data table.
>> allDat <- read.table("big_select_thresh_5", header=1)
>> 
>> # Where some rows look like this...
>> # PDB     SUNID1  SUNID2  AA      CH      IPCA    PCA     IBB     BB
>> # 3sdh    14984   14985   6       10      24      24      93      116
>> # 3hbi    14986   14987   6       10      20      22      94      117
>> # 4sdh    14988   14989   6       10      20      20      104     122
>> 
>> # NB First three columns = row ID, last 6 = variables
>> 
>> attach(allDat)
>> 
>> # My columns of interest (variables).
>> part <- data.frame(AA,CH,IPCA,PCA,IBB,BB)
>> 
>> pc <- princomp(part)
>
>Do you really want an unscaled PCA on that data set?  Looks unlikely (but 
>then two of the columns are constant in the sample, which is also 
>worrying).

That is just sample bias. By unscaled I assume you mean something like
normalized?

>> plot(pc)
>> 
>> The above plot shows that 95% of the variance is due to the first
>> 'Component' (which I assume is AA).
>
>No, it is the first (principal) component.  You did ask for P>C<A!
>
>> i.e. All the variables behave in quite much the same way.
>
>Or you failed to scale the data so one dominates.

Yes.

I added the following to the above....

x <- colMeans(part)
partNorm <- part/x
pc1 <- princomp(partNorm)

plot(pc1)

biplot(pc1)

Which shows two major components, and possibly a third.

What I want to know is that given my data is not uniformly distributed, is
my normalization valid?

I know I should find this out via further investigation of PCA, but in
general if my variables have a very skewed distribution (possibly without
a theoretically definable mean) should I attempt to use any standard
clustering technique?

I guess I should log transform my data.

Cheers,
Dan.

>> I then did ...
>> 
>> 
>> biplot(pc)
>> 
>> Which showed some outliers with a numeric ID - How do I get back my old 3
>> part ID used in allDat?
>
>Set row names on your data frame.  Like almost all of R, it is the row 
>names of a data frame that are used for labelling, and you did not give 
>any so you got numbers.
>
>> In the above plot I saw all the variables (correctly named) pointing in
>> more or less the same direction (as shown by the variance). I then did the
>> following...
>> 
>> postscript(file="test.ps",paper="a4")
>> 
>> biplot(pc)
>> 
>> dev.off()
>> 
>> However, looking at test.ps shows that the arrows are missing (using
>> ggv)... Hmmm, they come back when I pstoimg then xv... never mind.
>
>So ggv is unreliable, perhaps cannot cope with colours?
>
>> Finally, I would like to make a contour plot of the above biplot, is this
>> possible? (or even a good way to present the data?
>
>What do you propose to represent by the contours?  Biplots have a 
>well-defined interpretation in terms of distances and angles.
>
>