[BioC] clustering question

Mon Feb 20 15:58:52 CET 2006

Sean,

Thank you for the reply. Would you be able to provide a brief code chunk
for the 1-correlation function you describe?

Also, would anyone like to comment on the more bioinformatic slant of my
question, i.e. do you gain more knowledge about the system by clustering
using 1-correlation or by hclust? As a biologist, it seems to me that we
are more often interested in finding genes that behave similarly between
samples rather than genes with similar mean expression.

If this is true, I wonder if the developers of GOClust could comment on
the clustering algorithms included as options in their package, which
are Clara, Hclust, Kmeans, and Pam. Again, as a biologist, I believe
that GO clustering would most be most appropriately done on genes that
behave similarly between samples, rather than have similar mean
expression. If, for example, two genes are exactly inversely
proportional, then they should cluster right next to each other as they
may be co-regulated.

I feel fairly confident in my assertions as a biologist, but I am not a
mathematician, and, if I am misunderstanding how clustering works under
these various algorithms, please correct me.

Thanks,

Mark

Mark W. Kimpel, M.D.
-----Original Message-----
From: Sean Davis [mailto:sdavis2 at mail.nih.gov] 
Sent: Monday, February 20, 2006 8:04 AM
To: Kimpel, Mark William; Bioconductor
Subject: Re: [BioC] clustering question

On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at iupui.edu> wrote:

> I have a general question about clustering of genomic data. The
heatmaps
> that are generated are usually scaled row-wise so that variations are
> apparent within rows but not between rows. In looking at the
> documentation of heatmap and hclust, however, is appears that this
> scaling is done after the actual clustering is performed. If heatmap
is
> performed on the hclust object with scale="none", it is apparent that
> most of the row clustering is based on overall gene expression levels,
> not on similar column-wise behavior between rows.
> 
> Wouldn't it make sense to scale row-wise before clustering so that the
> row clusters are based more on the correlation of the behavior of rows
> between columns, i.e. two genes would be near each other if the genes
> behaved similarly across samples? I realize that some of this effect
may
> be achieved with unscaled data, but it seems to me that the large
> overall expression differences may minimize that.

Mark,

If I understand you correctly, you might want to look at the "distfun"
argument to heatmap. The distfun argument allows you to use any
dissimilarity function that you like, including 1-correlation if you
like.

Sean