[BioC] Can we use Intraclass correlation (ICC) to optimize clustering parameters?

Peter Langfelder peter.langfelder at gmail.com
Wed Feb 12 00:42:52 CET 2014


On Tue, Feb 11, 2014 at 2:08 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
> you might find the 'WCGNA' package to be useful for a starting point.  it
> is also extensively published and IIRC, at least one of the authors is on
> this list

One of WGCNA authors reporting for duty. Tim, thanks for the advertising! :)


>> Hi all,
>>
>> I am trying to find co-expressed genes in my affy data. I use hierarchical
>> clustering with dynamic tree cut. I want to choose optimal clustering/cut
>> parameters and I am new to cluster validation. I understand that there are
>> many cluster indices that can be used for cluster validation.

Validation usually means having an independent data set. If you do
have an independent data set and want to know whether the clusters you
found in your original ("reference") data set can be found in your
validation ("test") data set, you can use the WGCNA module
preservation statistics
(http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/ModulePreservation/).

If you wan to know whether you chose "optimal" clusters using only the
data set you derived the clusters from, there are many measures of
cluster quality that you can choose from; I am not an expert in this
area. Be aware that the Dynamic Tree Cut approach is very heuristic
and, depending on which measure of cluster quality yo choose, may not
lead to an optimal clustering (but it seems to reproduce clusters in
simulated data quite well, and on real data yields functionally
coherent modules).

>>
>> Since I am interested in co-expression only, can I simply use intraclass
>> correlation (ICC) as a metric to choose optimal parameters? ie, choose the
>> clustering parameters that gives the highest ICC in each cluster.
>>
>> Is ICC commonly used for choosing clustering parameters? Is it Ok? or Is
>> there any other more commonly used metric?

ICC (I assume you mean the average correlation among all profiles
within a cluster) should be a viable measure but will inevitably get
better as you increase the number of clusters, so simply maximizing
ICC will not work - you will need to include some penalty for the
number of clusters.


HTH,

Peter



More information about the Bioconductor mailing list