[R] How a clustering algorithm in R can end up with negative silhouette values?
sarah.goslee at gmail.com
Fri Feb 19 21:22:22 CET 2016
Ah, my guess about the confusion was wrong, then. You're
misunderstanding silhouette() instead.
Observations with a large s(i) (almost 1) are very well clustered,
a small s(i) (around 0) means that the observation lies between
two clusters, and observations with a negative s(i) are probably
placed in the wrong cluster.
In more detail, they're looking at different things.
clara() assigns each point to a cluster based on the distance to the
silhouette() does something different: instead of comparing the
distances to the closest medoid and the next closest medoid, which is
what you seem to be assuming, silhouette() looks at the mean distance
to ALL other points assigned to that cluster, vs the mean distance to
all points in other clusters. The distance to the medoid is
irrelevant, except as it is one of the points in that cluster.
So a negative silhouette value is entirely possible, and means that
the cluster produced doesn't represent the dataset very well.
On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
<Behnam.ABABAEI at limagrain.com> wrote:
> sorry for taking up your time.
> I totally agree with you about how it works. But please let's take a look at this part of the description:
> "Once k representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid. The mean (equivalent to the sum) of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The sub-dataset for which the mean (or sum) is minimal, is retained. A further analysis is carried out on the final partition."
> It says each observation is finally assigned to the closest medoid. The whole clustering process may be imperfect in terms of isolation of clusters, but each observation is already assigned to the closest one and according to the silhouette formula, the silhouette value cannot be negative, as a must be always less than b.
> From: Sarah Goslee <sarah.goslee at gmail.com>
> Sent: 19 February 2016 20:58
> To: ABABAEI, Behnam
> Cc: r-help at r-project.org
> Subject: Re: [R] How a clustering algorithm in R can end up with negative silhouette values?
> You need to think more carefully about the details of the clara() method.
> The algorithm draws repeated samples of sampsize from the larger
> dataset, as specified by the arguments to the function.
> It clusters each sample in turn, and saves the best one.
> It uses the medoids from the best one to assign all of the points to a cluster.
> But because the clustering is based on a subsample, it may not be
> representative of the dataset as a whole, and may not provide a good
> clustering overall. Just because it clusters the subsample well,
> doesn't mean it clusters the entirety. The details section of the help
> describes this, and the book references goes into more detail.
> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
> <Behnam.ABABAEI at limagrain.com> wrote:
>> Hi Sarah,
>> Thank you for the response. But it is said in its description that after
>> each run (sample), each observation in the whole dataset is assigned to the
>> closest cluster. So how is it possible for one observation to be wrongly
>> allocated, even with clara?
>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
>> <sarah.goslee at gmail.com> wrote:
>> That means that points have been assigned to the wrong groups. This
>> may readily happen with a clustering method like cluster::clara() that
>> uses a subset of the data to cluster a dataset too large to analyze as
>> a unit. Negative silhouette numbers strongly suggest that your
>> clustering parameters should be changed.
>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
>> <Behnam.ABABAEI at limagrain.com> wrote:
>>> We know that clustering methods in R assign observations to the closest
>>> medoids. Hence, it is supposed to be the closest cluster each observation
>>> can have. So, I wonder how it is possible to have negative values of
>>> silhouette , while we are supposedly assign each observation to the closest
>>> cluster and the formula in silhouette method cannot get negative?
More information about the R-help