[R] silhoutte.default bugs

Murad Nayal mn216 at columbia.edu
Wed Jan 21 21:19:28 CET 2004



Hello all,


This might have been fixed in later versions (I am using R1.7.0), r-help
archive contains messages reporting similar problems but no reports of
codes fixes. I have encountered a couple of problems using the
silhouette function. one occurs when the clustering contains clusters
composed of 1 element (Martin Maechler posted code few months ago that
fixes a similar problem that occurs when clusters have only 2 elements
but not the case with 1 element). the other problem is due to
silhouette's assumption that the clusters are numbered sequentially
starting at 1. one of the clustering programs I use (snob) assigns more
or less arbitrary integer ids to clusters starting from 3! (clusters 1
and 2 have special meaning in snob). the modified code fixing both
problems is included below, changes are commented.

best
Murad

silhouette.default <-
function (x, dist, dmatrix, ...) 
{
    cll <- match.call()
    if (!is.null(cl <- x$clustering)) 
        x <- cl
    n <- length(x)
    if (!all(x == round(x))) 
        stop("`x' must only have integer codes")
    k <- length(clid <- sort(unique(x)))
    if (k <= 1 || k >= n) 
        return(NA)
    if (missing(dist)) {
        if (missing(dmatrix)) 
            stop("Need either a dissimilarity `dist' or diss.matrix
`dmatrix'")
        if (is.null(dm <- dim(dmatrix)) || length(dm) != 2 || 
            !all(n == dm)) 
            stop("`dmatrix' is not a dissimilarity matrix compatible to
`x'")
    }
    else {
        dist <- as.dist(dist)
        if (n != attr(dist, "Size")) 
            stop("clustering `x' and dissimilarity `dist' are
incompatible")
        dmatrix <- as.matrix(dist)
    }
    wds <- matrix(NA, n, 3, dimnames = list(names(x), c("cluster", 
        "neighbor", "sil_width")))
    for (j in 1:k) {
        Nj <- sum(iC <- x == clid[j])
#
# the following line changed from  wds[iC, "cluster"] <- j
#
        wds[iC, "cluster"] <- clid[j]
        a.i <- if (Nj > 1) 
            colSums(dmatrix[iC, iC])/(Nj - 1)
        else 0
#
# the following line changed from 
# diC <- rbind(apply(dmatrix[!iC, iC], 2, function(r) tapply(r,
# x[!iC], mean)))
#
        diC <- rbind(apply(cbind(dmatrix[!iC, iC]), 2, function(r)
tapply(r, 
            x[!iC], mean)))
        minC <- max.col(-t(diC))
        wds[iC, "neighbor"] <- clid[-j][minC]
#
# the following line changed from 
# b.i <- diC[cbind(minC, seq(minC))]
#
        b.i <- diC[cbind(minC, seq(along=minC))]
        s.i <- (b.i - a.i)/pmax(b.i, a.i)
        wds[iC, "sil_width"] <- s.i
    }
    attr(wds, "Ordered") <- FALSE
    attr(wds, "call") <- cll
    class(wds) <- "silhouette"
    wds
}
-- 
Murad Nayal M.D. Ph.D.
Department of Biochemistry and Molecular Biophysics
College of Physicians and Surgeons of Columbia University
630 West 168th Street. New York, NY 10032
Tel: 212-305-6884	Fax: 212-305-6926




More information about the R-help mailing list