[R] Fwd: problem with kmeans

Ranjan Maitra maitra.mbox.ignored at inbox.com
Tue Apr 29 06:35:10 CEST 2014


Cassie,

I am sorry but do you even know what k-means does? That it is a locally
optimal algorithm. That different software implement the same algorithm
differently.

FYI, R uses the Hartigan-Wong (1979) algorithm by default, which is
probably the most efficient out there. 

I suggest you first go to a multivariate statistics class before
passing such sweeping statements. (Btw, did these same "some people"
tell you that most other software do not provide the kinds of broad
abilities which R provides, and therefore are not even comparable.)

And then, please read the help function for how to "improve" your run
of k-means using R.  

HTH,
Ranjan


On Tue, 29 Apr 2014 09:45:18 +0530 cassie jones
<cassiejones26 at gmail.com> wrote:

> Dear R-users,
> 
> I am trying to run kmeans on a set comprising of 100 observations. But R
> somehow can not figure out the true underlying groups, although other
> software such as Jmp, MINITAB are producing the desired result.
> 
> Following is a brief example of what I am doing.
> 
> library(stringdist)
> test=c('hematolgy','hemtology','oncology','onclogy',
> 'oncolgy','dermatolgy','dermatoloy','dematology',
> 'neurolog','nerology','neurolgy','nerology')
> 
> dis=stringdistmatrix(test,test, method = "lv")
> 
> set.seed(123)
> cl=kmeans(dis,4)
> 
> 
> grp_cl=vector('list',4)
> 
> for(i in 1:4)
> {
>     grp_cl[[i]]=test[which(cl$cluster==i)]
> }
> grp_cl
> 
> [[1]]
> [1] "oncology" "onclogy"
> 
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
> 
> [[3]]
> [1] "oncolgy"
> 
> [[4]]
> [1] "hematolgy"  "hemtology"  "dermatolgy" "dermatoloy" "dematology"
> 
> In the above example, the 'test' variable consists of a set of
> terminologies with various typos and I am trying to group the similar types
> of words based on their string distance. Unfortunately kmeans is not able
> to replicate the following result that the other software are able to
> produce.
> [[1]]
> [1] "oncology" "onclogy"  "oncolgy"
> 
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
> 
> [[3]]
> [1] "dermatolgy" "dermatoloy" "dematology"
> 
> [[4]]
> [1] "hematolgy"  "hemtology"
> 
> 
> Does anyone know if there is a way out, I have heard from a lot of people
> that multivariate analysis in R does not produce the desired result most of
> the time. Any help is really appreciated.
> 
> 
> Thanks in advance.
> 
> 
> Cassie
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. Please respond to the mailing list if appropriate.
For those needing to send personal or professional e-mail, please use
appropriate addresses.

____________________________________________________________
FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!



More information about the R-help mailing list