[R] daisy() for gower distance calculation

Martin Maechler maechler at stat.math.ethz.ch
Mon Nov 20 10:16:07 CET 2006


>>>>> "Tyler" == Tyler Smith <tyler.smith at mail.mcgill.ca>
>>>>>     on Sun, 19 Nov 2006 23:47:15 -0400 writes:

    Tyler> Gavin Simpson wrote:
    >> vegdist in package vegan has Gower's distance, but all
    >> variables have to be numeric.

    >> If you want to use mixed data (numerics, factors,
    >> binary), see ?daisy in package cluster.

    Tyler> This is a little unclear. vegdist will handle regular
    Tyler> quantitative variables as well as binary
    Tyler> variables. This is not so much a feature of vegdist
    Tyler> as of the Gower similarity, which treats binary and
    Tyler> quantitative variables identically, since a simple
    Tyler> matching coefficient produces the same similarity
    Tyler> value as is produced by Gower's quantitative
    Tyler> similarity function for a variable that can take only
    Tyler> two values.

    Tyler> Perhaps that's what you meant, and I just
    Tyler> misunderstood you. Perhaps I'm wrong, and someone
    Tyler> will correct me!

Two things, not really a correction:

- daisy() is in Recommended package cluster which is part of every
	R installation, so why not try it first?

- daisy() has been developed for and documented in the book by
  Kaufman and Rousseeuw (1990). They have strived to be more flexible
  than Gower's original proposal, and I (as maintainer of the
  cluster 'package') had further tweaked the daisy() implementation.

  It allows missing values (NAs)
  and differentiates and hence allows to specify
  the following 6--7 type of variables:

  continuous: "interval-scaled", "ordratio", "logratio"
	      (where the last one just means to work on log()ed variables)
  discrete: 
	    asymmetric binary "A"
	     symmetric binary "S"
	     nominal	      "N" - (unordered) factor
	     ordered	      "O" - ordered factor

  where all but the "*ratio" and binary types are determined by
  default from the variables in the data frame.
  For binary variables, using "symmetric" is effectively the
  same as using "interval scaled" and this is used by default,
  but the default now has been giving a warning to the user,
  since	the reference (and I) have been recommending to *think*
  if *a*symmetric binary was not more appropriate {which it is
  many cases in todays applicaitons}.

Regards,
Martin Maechler



More information about the R-help mailing list