[R] pam() with more general dissimilarity / distance

Fri Apr 8 13:55:15 CEST 2022

I was asked in private, but reply in public,
so others can also find this answer in the future:

On Fri, Apr 8, 2022 at 1:11 PM  ..... wrote :
>  Hello
> dear Dr. Maechler
> I have a question about "pam" function in the cluster package. In this
> function, we choose one of the  euclidean or manhattan distances to
> calculate dissimilarity but in the mixed typed data sets the true index may
> be jaccard or other indicators.
> How can we allocate the "true" metric for each variable?
> Best regards
>

yes,  you can use pam() use in two ways;  see this part of the help page :

  Arguments:

       x: data matrix or data frame, or dissimilarity matrix or object,
          depending on the value of the ‘diss’ argument.

          In case of a matrix or data frame, each row corresponds to an
          observation, and each column corresponds to a variable.  All
          variables must be numeric.  Missing values (NAs) _are_
          allowed-as long as every pair of observations has at least
          one case not missing.

          In case of a dissimilarity matrix, ‘x’ is typically the
          output of daisy or dist.  Also a vector of length
          n*(n-1)/2 is allowed (where n is the number of observations),
          and will be interpreted in the same way as the output of the
          above-mentioned functions. Missing values (NAs) are _not_
          allowed.

So, you can first use   dx <-  daisy(x, ...)     and use the correct
distance between your observational units,
After that you can use the computed distance / dissimilarity matrix
(the `dx`)  in you call to pam():

px <- pam(dx, k=., ....)

I hope this helps you.
With best regards,
Martin

--
Martin Maechler
ETH Zurich

‪