[R] Cluster analysis using term frequencies

Christian Hennig ucakche at ucl.ac.uk
Tue Mar 24 14:39:27 CET 2015


Dear Sun Shine,

>> dtes <- dist(tes.df, method = 'euclidean')
>> dtesFreq <- hclust(dtes, method = 'ward.D')
>> plot(dtesFreq, labels = names(tes.df))
>
> However, I get an error message when trying to plot this: "Error in 
> graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : invalid 
> dendrogram input".

I don't see anything wrong with the code, so what I'd do is run
str(dtes) and str(dtesFreq) to see whether these are what they should be 
(or if not, what they are instead).

> I'm clearly screwing something up, either in my source data.frame or in my 
> setting hclust up, but don't know which, nor how.

Can't comment on your source data but generally, whatever you do, use 
str() or even print() to see whether the R-objects are allright or what 
went wrong.

> More than just identifying the error however, I am interested in finding a 
> smart (efficient/ elegant) way of checking the occurrence and frequency value 
> of the terms that may be associated with 'sports', 'learning', and 
> 'extra-mural' and extracting these into a matrix or data frame so that I can 
> analyse and plot their clustering to see if how I associated these terms is 
> actually supported statistically.

The first thing that comes to my mind (not necessarily the best/most 
elegant) is to run...
dtes3 <- cutree(dtesFreq,3)
...and to table dtes3 against your manual classification.
Note that 3 is the most "natural" number of clusters to cut the tree 
here but may not be the best to match your classification (for example, 
you may have a one-point cluster in the 3-cluster solution, so it may 
effectively be a two-cluster solution with an outlier). Your 
dendrogram, if you succeed plotting it, may give you a hint about that.

Hope this helps,
Christian


>
> I'm sure that there must be a way of doing this in R, but I'm obviously not 
> going about it correctly. Can anyone shine a light please?
>
> Thanks for any help/ guidance.
>
> Regards,
> Sun
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hennig at ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



More information about the R-help mailing list