[R] Ward's Clustering Doubts

Mark Difford mark_difford at yahoo.co.uk
Mon Sep 15 13:09:26 CEST 2008


Hi Rodrigo,

[apropos of Ward's method]

>> ... we saw something like "You must use it with Euclidean Distance..."

Strictly speaking this is probably correct, as Ward's method does an
analysis of variance type of decomposition and so doesn't really make much
sense  (I think) unless Euclidean distance (i.e. least-squares) is used.

However, there may be ways around this. First, because a distance metric is
non-Euclidean does not mean that it is always non-Euclidean. You can test
this using ?is.euclid in package ade4. You can also test your matrix by
doing a principal co-ordinate analysis; then look for negative eigenvalues.
If none are found, the matrix is Euclidean and it should be OK to use Ward's
method on that data set.

Probably a better approach is to make your distance matrix Euclidean. There
are several functions in ade4 that will do this. The idea then is to
present/compare the two solutions: the first using the uncorrected,
non-Euclidean distance matrix, the second using the corrected version. You
could use procrustes/co-inertia analysis to compare the two in an
intermediate step.

Regards, Mark.


Rodrigo Aluizio wrote:
> 
> Hi Everybody,
> Now I have a doubt that is more statistical than R's technical. I’m
> working with ecology of recent Foraminifera.
> 
> At the lab we used to perform cluster analysis using 1-Pearson’s R and
> Wards method (we already saw it in bibliography of the area) which renders
> good results with our biological data. Recently, using “R” Software (vegan
> and Cluster packages) which allows the combination of any kind of
> distances matrix with any clustering method, we tried to used Bray Curtis
> + Wards (which seem to be more appropriate to a matrix with a lot of
> zeros) and it renders a better result. Furthermore, the results agree with
> our hypothesis and with the results we have got with the Distance-based
> Redundancy Analysis - dbRDA or CAP. It means, the analysis (Q-mode)
> clusters the stations according to the main physical, sedimentary and
> biological characteristics of the study area.
> 
> We received some critical comments noticing that Wards Method accepts
> Euclidean Distance only. So, we made the analysis again using Euclidean
> Distance but we don’t get the better results we had using 1-Pearson’s R +
> Wards or Bray Curtis + Wards (actually any other distance + method
> combination rendered better results). Trying to find answers in the
> specialized literature we just got little more confused because in any
> moment we saw something like "You must use it with Euclidean Distance" and
> like I said above we already saw in some articles from respected journals,
> other kind of distance associated with the Ward's Clustering method. 
> 
> Is it wrong or is it “non sense” to do the analysis in the way we were
> doing?
> 
> The results with Wards combined with 1-Pearson’s R or Bray Curtis fit
> better with our hypothesis and have excellent agglomerative coefficients ,
> but we don’t want to make inappropriate statistical procedures. I'm
> starting to realize how powerful R is, but it doesn't justify doing
> nonsense statistics...  I hope one of you may help us!
> 
> Thank you in advance.
> 
> Rodrigo.
> 
> 	[[alternative HTML version deleted]]
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Ward%27s-Clustering-Doubts-tp19486028p19490991.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list