[BioC] clustering question

Thu Feb 23 13:26:40 CET 2006

On 2/20/06 9:58 AM, "Kimpel, Mark William" <mkimpel at iupui.edu> wrote:

> Sean,
> 
> Thank you for the reply. Would you be able to provide a brief code chunk
> for the 1-correlation function you describe?
> 
> Also, would anyone like to comment on the more bioinformatic slant of my
> question, i.e. do you gain more knowledge about the system by clustering
> using 1-correlation or by hclust? As a biologist, it seems to me that we
> are more often interested in finding genes that behave similarly between
> samples rather than genes with similar mean expression.

Clustering algorithms often take as input a matrix of dissimilarities (how
different the things are that one is clustering).  Hclust and friends all
have default measures of dissimilarity; for hclust, this is euclidean
distance.  You can use any distance metric you like such as:

 plot(hclust(as.dist(1-cor(mymatrix))))

If you want to get both correlated and anticorrelated genes, then use
1-abs(cor(...)).  Hopefully, you get the idea.

> If this is true, I wonder if the developers of GOClust could comment on
> the clustering algorithms included as options in their package, which
> are Clara, Hclust, Kmeans, and Pam. Again, as a biologist, I believe
> that GO clustering would most be most appropriately done on genes that
> behave similarly between samples, rather than have similar mean
> expression. If, for example, two genes are exactly inversely
> proportional, then they should cluster right next to each other as they
> may be co-regulated.

You probably need to read the help pages for each of these different
clustering methods carefully if you are concerned about the details.  If I
am not mistaken, Goclust simply uses the clara, kmeans, etc. from other
packages to perform the clustering, so reading the corresponding help pages
will likely be enlightening.

> I feel fairly confident in my assertions as a biologist, but I am not a
> mathematician, and, if I am misunderstanding how clustering works under
> these various algorithms, please correct me.
> 
> Thanks,
> 
> Mark
> 
> 
> 
> Mark W. Kimpel, M.D.
> -----Original Message-----
> From: Sean Davis [mailto:sdavis2 at mail.nih.gov]
> Sent: Monday, February 20, 2006 8:04 AM
> To: Kimpel, Mark William; Bioconductor
> Subject: Re: [BioC] clustering question
> 
> 
> 
> 
> On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at iupui.edu> wrote:
> 
>> I have a general question about clustering of genomic data. The
> heatmaps
>> that are generated are usually scaled row-wise so that variations are
>> apparent within rows but not between rows. In looking at the
>> documentation of heatmap and hclust, however, is appears that this
>> scaling is done after the actual clustering is performed. If heatmap
> is
>> performed on the hclust object with scale="none", it is apparent that
>> most of the row clustering is based on overall gene expression levels,
>> not on similar column-wise behavior between rows.
>> 
>> Wouldn't it make sense to scale row-wise before clustering so that the
>> row clusters are based more on the correlation of the behavior of rows
>> between columns, i.e. two genes would be near each other if the genes
>> behaved similarly across samples? I realize that some of this effect
> may
>> be achieved with unscaled data, but it seems to me that the large
>> overall expression differences may minimize that.
> 
> Mark,
> 
> If I understand you correctly, you might want to look at the "distfun"
> argument to heatmap. The distfun argument allows you to use any
> dissimilarity function that you like, including 1-correlation if you
> like.
> 
> Sean
> 
>