[R] Similarity matrix

Kaspar Pflugshaupt pflugshaupt at geobot.umnw.ethz.ch
Wed Apr 11 14:25:22 CEST 2001


On Wednesday 11 April 2001 10:23, Prof Brian Ripley wrote:


> And what does S-PLUS use? (Which is the point here?)


I've never done cluster analysis with S-Plus. But let's see:

The statistical manual for S-Plus 5.1/Unix fails to even mention similarity 
matrices.

help(hclust) (in S-Plus 5.1/Unix and 3.4/Unix) says 

  USAGE:                                                            
      
  hclust(dist, method = "compact", sim =)
                                       
  [...]         
                                                         
   sim=                                                  
          structure giving similarities rather than distances. This can
          either be a symmetric matrix or a vector with a "Size"       
          attribute. Missing values are not allowed.

The help text does not explain how the conversion to distances is done, 
though. And the source is not available...

> I guess we have to experiment?


Well, I've taken the time to do it for you (S-PLus 3.4/Unix):

  mat <- matrix(runif(100), nrow=10)
  print(1 - plclust(hclust( sim=mat ))$yn)  # 1 - ...: S-Plus seems to mirror 
					    # the tree's y scale when given a similarity matrix

gives the same values as

  print(plclust(hclust( 1-mat ))$yn)

but different values from

  print(plclust(hclust( sqrt(1-mat) )$yn)

The grouping structure is constant, anyway.

So, S-Plus seems to use D=1-S rather than D=sqrt(1-S) internally.

For R, it might be a good idea to let the user choose the conversion method 
via an additional parameter, making D=1-S the default.

According to Legendre & Legendre, the choice of similarity coefficient 
_does_ make a difference as to which conversion should be preferred. For some 
"species" of similarity coefficients, the resulting distance would be metric 
and euclidean with one method but not with the other, for others vice versa. 
I don't know if this matters for cluster analysis, but I think that it might, 
especially when clustering with an euclidean metric.


Cheers (hoping this was to the point :-)

Kaspar Pflugshaupt

-- 

Kaspar Pflugshaupt
Geobotanical Institute
ETH Zurich, Switzerland
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list