[BioC] clustering question

Thu Feb 23 23:37:12 CET 2006

Dear Mark,
Thanks for your kind remarks.

I think all you want to do is use Euclidean distance after removing 
the gene mean.

ExprNoMean=exprs(myData)-apply(myData,1,mean)

hclust(dist(ExprNoMean))

or

heatmap(ExprNoMean)

will then do complete linkage clustering on your data using Euclidean distance.

There are so many clustering options, and I am not very knowledgeable 
about the pros and cons of each, but complete linkage often seems reasonable.

Sorry that I do not have time to look at your data - too much of my 
own right now.

--Naomi

At 04:15 PM 2/23/2006, Kimpel, Mark William wrote:
>Naomi,
>
>
>
>Thanks for addressing my clustering question. I wanted to follow up
>because I'm still not clear on what I need to do and how I need to do
>it.
>
>
>
>Basically, I would like genes to cluster together that behave similarly
>between two samples. I don't care what there absolute level of
>expression is, but if, over 12 samples, genes that go up and down a
>similar amount should cluster together.
>
>
>
>I have looked at heatmap, hclust, and dist functions and am still unsure
>how to proceed.
>
>
>
>If, for example, I have a eset that I'm working with, you provide me a
>code example of how to get what I want?
>
>
>
>I really appreciate your help, not only with this, but all the BioC
>posts that I've learned from.
>
>
>
>Mark
>
>
>
>Mark W. Kimpel, M.D.
>
>   _____
>
>From: Naomi Altman [mailto:naomi at stat.psu.edu]
>Sent: Tuesday, February 21, 2006 11:46 AM
>To: Kimpel, Mark William; Sean Davis; Bioconductor
>Subject: Re: [BioC] clustering question
>
>
>
>hclust can use any distance metric.
>
>1-correlation is one of several metrics you could use.
>
>1-correlation focuses on the "up and down" behavior, but scales each
>gene to have the same standard deviation.
>Euclidean distance focuses more on the overall level of expression.
>Euclidean distance with the mean or median removed focuses on the "up
>and down" behavior, but also considers the magnitude of that behavior.
>
>
>There are many other choices of metric and clustering method.  I don't
>think you can really state that one method or metric produces results
>that are more
>"biologically meaningful".  I think that depends on what you mean by
>"biologically meaningful".
>
>If you have replicate arrays for the same biological condition, these
>should be averaged before clustering, as you want to cluster based on
>the response to the
>condition, not on the noise.
>
>The paper below is very readable and sheds a lot of light on these
>issues
>
>Problems in gene clustering based on gene expression data
><http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6WK9-4CB0HCH-1
>-1&_cdi=6901&_user=209810&_orig=search&_coverDate=07/31/2004&_sk=9990999
>98&view=c&wchp=dGLbVtb-zSkWz&md5=6cda9b21f97456578db16db820339ae1&ie=/sd
>article.pdf>  Journal of Multivariate Analysis 90 (2004) 44-66 Jenny
>Bryan
>
>--Naomi
>
>At 09:58 AM 2/20/2006, Kimpel, Mark William wrote:
>
>
>
>Sean,
>
>Thank you for the reply. Would you be able to provide a brief code chunk
>for the 1-correlation function you describe?
>
>Also, would anyone like to comment on the more bioinformatic slant of my
>question, i.e. do you gain more knowledge about the system by clustering
>using 1-correlation or by hclust? As a biologist, it seems to me that we
>are more often interested in finding genes that behave similarly between
>samples rather than genes with similar mean expression.
>
>If this is true, I wonder if the developers of GOClust could comment on
>the clustering algorithms included as options in their package, which
>are Clara, Hclust, Kmeans, and Pam. Again, as a biologist, I believe
>that GO clustering would most be most appropriately done on genes that
>behave similarly between samples, rather than have similar mean
>expression. If, for example, two genes are exactly inversely
>proportional, then they should cluster right next to each other as they
>may be co-regulated.
>
>I feel fairly confident in my assertions as a biologist, but I am not a
>mathematician, and, if I am misunderstanding how clustering works under
>these various algorithms, please correct me.
>
>Thanks,
>
>Mark
>
>
>
>Mark W. Kimpel, M.D.
>-----Original Message-----
>From: Sean Davis [ mailto:sdavis2 at mail.nih.gov
><mailto:sdavis2 at mail.nih.gov> ]
>Sent: Monday, February 20, 2006 8:04 AM
>To: Kimpel, Mark William; Bioconductor
>Subject: Re: [BioC] clustering question
>
>
>
>
>On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at iupui.edu> wrote:
>
> > I have a general question about clustering of genomic data. The
>heatmaps
> > that are generated are usually scaled row-wise so that variations are
> > apparent within rows but not between rows. In looking at the
> > documentation of heatmap and hclust, however, is appears that this
> > scaling is done after the actual clustering is performed. If heatmap
>is
> > performed on the hclust object with scale="none", it is apparent that
> > most of the row clustering is based on overall gene expression levels,
> > not on similar column-wise behavior between rows.
> >
> > Wouldn't it make sense to scale row-wise before clustering so that the
> > row clusters are based more on the correlation of the behavior of rows
> > between columns, i.e. two genes would be near each other if the genes
> > behaved similarly across samples? I realize that some of this effect
>may
> > be achieved with unscaled data, but it seems to me that the large
> > overall expression differences may minimize that.
>
>Mark,
>
>If I understand you correctly, you might want to look at the "distfun"
>argument to heatmap. The distfun argument allows you to use any
>dissimilarity function that you like, including 1-correlation if you
>like.
>
>Sean
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>
>Naomi S. Altman                                814-865-3791 (voice)
>Associate Professor
>Dept. of Statistics                              814-863-7114 (fax)
>Penn State University                         814-865-1348 (Statistics)
>University Park, PA 16802-2111
>
>
>         [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111