[BioC] Question about clustering and cluster validation

Robert Chapman ChapmanR at dnr.sc.gov
Fri Nov 19 14:08:13 CET 2010


Have you tried the freeware program called WEKA?
Bob
________________________________________
From: bioconductor-bounces at stat.math.ethz.ch [bioconductor-bounces at stat.math.ethz.ch] On Behalf Of January Weiner [january.weiner at mpiib-berlin.mpg.de]
Sent: Friday, November 19, 2010 7:58 AM
To: BioC
Subject: [BioC] Question about clustering and cluster validation

Dear all,

in short, I would like to decide whether a certain data set contains
sub-groups (clusters), or is uniform.

There are roughly 500 features and 50 samples. I am looking for
clusters of samples.

There is a clear division in a small number of features (3-4)
indicating the existence of subgroups, and a much less clear situation
in many other features. Pvclust, which I use preferentially (mostly
because it gives me a p-value surrogate), indicates two main clusters
with AU p-values of 99 and 98, and BP p-values of 0 and 1,
respectively.

Clustering with other methods gives contradictory results. I have
tried MClust and several "regular" methods. In short, I am not really
sure.

On a PCA plot using all features, two clusters can be seen, but are
not clearly divided. If I assign the samples to the clusters
identified by pvclust and apply randomForests, I can distinguish
between the classes fairly well, but that seems like something one
should rather not do.

Furthermore, there is for sure an additional complication, which is
the fact that for some particular features, there is a pre-defined
clustering (male vs female). However, the clusters I am considering
are not related to the difference between sexes.

Is there a statistical test available that would compare the zero
hypothesis "there are no sub-clusters" with the alternative hypothesis
"there are two clusters", or "there are no sub-clusters" with "there
are these two particular clusters"?

I was thinking along the following lines: perform X random divisions.
Perform t-tests for each feature, record significance. See whether the
proposed division is significantly better than random divisions in the
data, the statistics being here "number of significantly different
features" or something similar.

Best regards,

January

--
-------- Dr. January Weiner 3 --------------------------------------

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list