[R] k-means: should columns in dataset be in same scale?

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Apr 23 07:46:23 CEST 2008


k-means uses Euclidean distance, so scaling of the variables does matter.
Whether you want to standardize depends on the example (as it does in most 
multivariate analysis problems, e.g. PCA has the same issues).

On Tue, 22 Apr 2008, Johan Jackson wrote:

> Hi all,
>
> Simple question re k-means. If I have a data set with columns that are on
> different scales (say col 1 has var=100 and col2 var=2), will this make a
> difference to the k-means algorithm? It seems as though it does. If so,
> should we first standardize the columns of the dataset so that each column
> is given equal weight?
>
> JJ

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list