[R] k-means with euclidian distance but no coordinates

Huntsinger, Reid reid_huntsinger at merck.com
Fri Dec 14 16:15:21 CET 2001


K-means uses coordinates to actually calculate the k within-cluster means
after classifying points based on distance to the previous iteration's means
(centroids). The mean is used as it minimizes the sum of squared distances
to cluster points. You could try to find this minimizer another way. You
would probably restrict to minimizers from your data set as you can't
calculate distance for other words...

You could also try to get a low-dimensional representation with
multidimensional scaling (MDS). It takes a distance matrix as input and
provides for each input point a point in a low-dimensional Euclidean space.
One option is to do this for a sample, then approximate the mapping eg with
a flexible regression approach. I've seen this work well in some perhaps
similar cases.

There are a lot of approaches to mapping into a low-dimensional Euclidean
space based essentially on principal components of the co-occurrence matrix.
Are you looking for alternatives to these? These or the MDS approach above
would let you use stock k-means, and both can be done in R.

Reid Huntsinger





-----Original Message-----
From: Corrin Lakeland [mailto:lakeland at atlas.otago.ac.nz]
Sent: Thursday, December 13, 2001 3:43 PM
To: r-help at stat.math.ethz.ch
Subject: [R] k-means with euclidian distance but no coordinates


Hi,

I'm trying to build a thesaurus that will sensible values for rare words.  
I suspect the best algorithm to use is k-means although I'm not sure about
that -- I would have preferred a k dimensional space with a binary cluster
in each dimension so a word can belong to 0..k clusters, but I digress...

I can measure the strength of correlation between words fairly easily by
counting cooccurance divided by frequency of each word, giving a euclidian
distance, although this doesn't work especially well for rare words.  
However I don't have coordinates as such, and deriving them given distance
is non-trivial.

Now, as I understand k-means, it uses euclidian distance rather than
coordiantes, the first step given in texts is to derive the distance given
the coordinates. But I can't find a way to call the built in function
without coordinates.  I had a look at R-1.3.1/src/library/mva/src/kmns.f
but my Fortran isn't good and I had enough trouble following the code, so
I'm not up to making major changes.

Any help or ideas would be appreciated

Corrin
--
Corrin Lakeland <lakeland at cs.otago.ac.nz> 
Department of Computer Science
University of Otago, New Zealand


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list