[R] scale or not to scale that is the question - prcomp

Wed Aug 19 14:49:52 CEST 2009

On 19/08/2009 8:31 AM, Petr PIKAL wrote:
> Dear all
> 
> here is my data called "rglp"
> 
> structure(list(vzorek = structure(1:17, .Label = c("179/1/1", 
> "179/2/1", "180/1", "181/1", "182/1", "183/1", "184/1", "185/1", 
> "186/1", "187/1", "188/1", "189/1", "190/1", "191/1", "192/1", 
> "R310", "R610L"), class = "factor"), iep = c(7.51, 7.79, 5.14, 
> 6.35, 5.82, 7.13, 5.95, 7.27, 6.29, 7.5, 7.3, 7.27, 6.46, 6.95, 
> 6.32, 6.32, 6.34), skupina = c(7.34, 7.34, 5.14, 6.23, 6.23, 
> 7.34, 6.23, 7.34, 6.23, 7.34, 7.34, 7.34, 6.23, 7.34, 6.23, 6.23, 
> 6.23), sio2 = c(0.023, 0.011, 0.88, 0.028, 0.031, 0.029, 0.863, 
> 0.898, 0.95, 0.913, 0.933, 0.888, 0.922, 0.882, 0.923, 1, 1), 
>     p2o5 = c(0.78, 0.784, 1.834, 1.906, 1.915, 0.806, 1.863, 
>     0.775, 0.817, 0.742, 0.783, 0.759, 0.787, 0.758, 0.783, 3, 
>     2), al2o3 = c(5.812, 5.819, 3.938, 5.621, 3.928, 3.901, 5.621, 
>     5.828, 4.038, 5.657, 3.993, 5.735, 4.002, 5.728, 4.042, 6, 
>     5), dus = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
>     1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("ano", "ne"), class = 
> "factor")), .Names = c("vzorek", 
> "iep", "skupina", "sio2", "p2o5", "al2o3", "dus"), class = "data.frame", 
> row.names = c(NA, 
> -17L))
> 
> and I try to do principal component analysis. Here is one without scaling
> 
> fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2)
> biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8)
> 
> you can see that data make 3 groups according to variables sio2 and dus 
> which seems to be reasonable as lowest group has different value of dus = 
> "ano" while highest group has low value of sio2.
> 
> But when I do the same with scale=T
> 
> fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2, 
> scale=T)
> biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8)
> 
> I get completely different picture which is not possible to interpret in 
> such an easy way.
> 
> So if anybody can advice me if I shall follow recommendation from help 
> page (which says
> The default is FALSE for consistency with S, but in general scaling is 
> advisable.
> or if I shall stay with scale = FALSE and with simply interpretable 
> result?

I would say the answer depends on the meaning of the variables.  In the 
unusual case that they are measured in dimensionless units, it might 
make sense not to scale.  But if you are using arbitrary units of 
measurement, do you want your answer to depend on them?  For example, if 
you change from Kg to mg, the numbers will become much larger, the 
variable will contribute much more variance, and it will become a more 
important part of the largest principal component.  Is that sensible?

Duncan Murdoch