[BioC] How to do clustering
Thomas Girke
thomas.girke at ucr.edu
Sun Jun 10 19:34:47 CEST 2007
Here is an example that shows one way of doing this:
# Generate a sample matrix
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# Transpose the matrix if nessecary like this: y <- t(y)
# Use the following step if you want to use Pearson correlations as distance method
# instead of the default Euclidean distances.
mydist <- as.dist(1-cor(t(y), method="pearson"))
# PAM clustering, which is an advanced k-means method in R. The basic k-means function is kmeans()
library(cluster)
pamy <- pam(mydist, k=3)
pamy$clustering # provides the cluster assigments
plot(pamy) # plots the results
# MDS clustering to obtain 'meaningful' coordinates for a scatter plot
loc <- cmdscale(mydist)
# Generate a scatter plot for the MDS results where the PAM (k-means) clusters are labeled by color
mycol <- as.vector(pamy$clustering)
mycol <- rainbow(length(unique(mycol)), start=0.1, end=0.9)[mycol] # color selection steps
plot(loc[,1], loc[,2], pch=20, col=mycol, xlab="", ylab="", main="Scatter Plot")
# Scatter plot with sample labels
plot(loc[,1], loc[,2], type="n", xlab="", ylab="", main="Scatter Plot")
text(loc[,1], loc[,2], col=mycol, rownames(loc), cex=0.8)
More detailed instructions on basic clustering methods in R can be found on this page:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html#R_clustering
Thomas
On Sun 06/10/07 02:55, ssls sddd wrote:
> Dear Bill,
> On 6/9/07, William Shannon <william.shannon at sbcglobal.net> wrote:
> > It depends on your goal for the analysis.
> >
> > If you are wanting to find snp's whose log2(ratio's) are similar across
> > the samples then you are done with the analysis after k-means (though you
> > should read the literature on k-means for various ways to select the optimal
> > k). In this case you can extract the names of the snp's in each of the K
> > clusters directly from the kmeans object.
> >
> > If however you want to go one step further and see how these clusters
> > separate the samples then you could try what we did a long time ago in the
> > paper cited below (I can email you a of on Monday if you can't access it).
> >
> > In this paper we took the k-mean cluster centers and sorted them by
> > their log2(ratio) and looked to see how well they separated 2 (or maybe it
> > was 3) classes of skin samples.
> >
> > A. M. Bowcock, W. Shannon, F. Du, J. Duncan, K. Cao, K. Aftergut, J.
> > Catier, M. A. Fernandez-Vina, and A. Menter
> > *Insights into psoriasis and other inflammatory diseases from large-scale
> > gene expression studies*
> > Hum. Mol. Genet., August 1, 2001; 10(17): 1793 - 1805.
> >
> > Bill
> > *ssls sddd <ssls.sddd at gmail.com>* wrote:
> >
> > Dear Bill,
> >
> > Thanks a lot for the suggestions. Yes, they are Affy SNP data.
> > I used the MantelCorr Package. It worked well. Specifically, the commands
> > I ran are:
> >
> > library(MantelCorr)
> > kmeans.result <- GetClusters(x, 500, 100)
> > DistMatrices.result <- DistMatrices(x, kmeans.result$clusters)
> > MantelCorrs.result <- MantelCorrs(DistMatrices.result$Dfull,
> > DistMatrices.result$Dsubsets)
> > permuted.pval <- PermutationTest(DistMatrices.result$Dfull,
> > DistMatrices.result$Dsubsets, 100, 49, 0.05)
> > ClusterLists <- ClusterList(permuted.pval, kmeans.result$cluster.sizes,
> > MantelCorrs.result)
> > ClusterGenes <- ClusterGeneList(kmeans.result$clusters,
> > ClusterLists$SignificantClusters, data)
> >
> > Can you suggest me how to view the result? Is there a way to visualize the
> > clusters?
> >
> > Thanks a lot!
> >
> > Sincerely,
> >
> > Alex
> >
> > > You may want to consider a k-means cluster. The pvclust appears to be a
> > > hierarchical clustering algorithm (with subsequent p value estimation)
> > which
> > > is causing the problem.
> > >
> > > Hierarchical clustering uses a pairwise distance matrix to form the tree
> > > dendrogram. With N = 238804 this will require a matrix with N(N-1)/2 or
> > > about (238804^2)/2 elements. That's what causes the memory problem.
> > >
> > > K-means is not so intensive and will result in clustering the 238804
> > rows
> > > (I assume they are snp's) and each cluster will be represented by a men
> > > vector for the 49 variables.
> > >
> > > If on the other hand you want to cluster the 49 columns you may need to
> > > transpose the data matrix and then run a hierarchical clustering, but I
> > > would look into kmeans first.
> > >
> > > Bill Shannon
> > > Washington Univ. School of Medicine
> > >
> > >
> > > Dear List,
> > >
> > > I have a question to bother you about how to do clustering.
> > > My data consists of 49 columns (49 variables) and 238804 rows.
> > > I would like to do hierarchical clustering (unsupervised clustering
> > > and PCA). So far I tried pvclust
> > > (www.is.titech.ac.jp/~shimo/prog/<http://www.is.titech.ac.jp/%7Eshimo/prog/>
> > > *pvclust*/)
> > > but I always had the problem like for R like "cannot allocate the
> > memory".
> > >
> > > I am curious about what else packages can perform the clustering
> > analysis
> > > while memory efficient.
> > >
> > > Meanwhile, is there any way that I can extract the features of each
> > > cluster.
> > >
> > > In other words, I would like to identify which are responsible for
> > > classifying these
> > > variables (samples).
> > >
> > > Thanks a lot!
> > >
> > > Sincerely,
> > >
> > > Alex
> > >
