[R] Hidden Problems with Clustering Algorithms

Tue Nov 22 05:26:14 CET 2022

Most clustering algorithms are just heuristic and exploratory, well known
to be subject to the sorts of issues that you are concerned with. There are
no universal fixes ... just tradeoffs. You need to choose that which is
most appropriate for your situation and data,.

Bert

On Mon, Nov 21, 2022, 19:28 Leonard Mada via R-help <r-help using r-project.org>
wrote:

> Dear R-Users,
>
> Hidden Problems with Clustering Algorithms
>
> I stumbled recently upon a presentation about hierarchical clustering.
> Unfortunately, it contains a hidden problem of clustering algorithms.
> The problem is deeper and I think that it warrants a closer inspection
> by the statistical community.
>
> The presentation is available online. Both the scaled & non-scaled
> versions show the problem.
>
> de.NBI course - Advanced analysis of quantitative proteomics data using
> R: 03b Clustering Part2
> [Note: it's more like introductory notes to basic statistics]
> https://www.youtube.com/watch?v=7e1uW_BhljA
> times:
> - at 6:15 - 6:28 & 6:29 - 7:10 [2 versions, both non-scaled]
> - at 5:51 - 6:10 [the scaled version]
> - same problem at 7:56;
>
> PROBLEM
>
> Non-Scaled Version: (e.g. the one at 6:15)
> - the upper 2 rows are split into various sub-clusters;
> - the top tree: a cluster is formed by the right-right sub-tree (some 17
> "genes" or similar "activities" / "expressions");
> - the left-most 2 "genes" are actually over-expressed "genes" and
> functionally really belong to the previous/right sub-cluster;
>
> Scaled-Version: (at 5:52)
> - the left-most 2 "genes" are over-expressed at the same time with the
> right cluster, and not otherwise;
>
> Unfortunately, the 2 over-expressed (outliers or extreme-values) are
> split off from the relevant cluster and inserted as a separate
> main-branch in the top dendrogram. Switching only the main left & right
> branches in the top tree would only mask this problem. The 2
> pseudo-outliers are really the (probably) upper values in the larger
> cluster of over-expressed "genes" (all the dark genes should belong to
> the same cluster).
>
> The middle sub-cluster shows really NO activity (some 16 "genes"). The
> main branches in the top tree should really split between this
> *NO*-activity cluster and the cluster showing activity (including the 2
> massively over-expressed genes). The problem is present in the scaled
> version as well.
>
> The hierarchical clustering algorithm fails. I have not analysed the
> data, but some problems may contribute to this:
> - "gene expression" or "activity" may not be linear, but exponential or
> follow some power rule: a logarithmic transformation (or some other
> transformation) may have been useful;
> - simple distances between clusters may be too inaccurate;
> - the variance in the low-activity (middle) cluster may be very low
> (almost 0!), while the variance in the high-activity cluster may be much
> higher: the Mahalanobis distance or joining the sub-clusters based on
> some z/t-test taking into account the different variances may be more
> robust;
>
> These questions should be addressed by more senior statisticians.
>
> I hope that the presentation remains on-line as is, as the clustering
> problem is really easy to see and to analyse. It is impossible to detect
> and visualise such anomalies in a heatmap with 1,000 gene-expressions or
> with 10,000 genes, or with 500-1000 samples. It is very obvious on this
> small heatmap.
>
> I do not know if there are any robust tools to validate the generated
> trees. Inspecting by "eye" a dendrogram with > 1,000 genes and hundreds
> of samples is really futile.
>
> Sincerely,
>
> Leonard
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]