[BioC] HOPACH clustering of genes

Shannon, William WSHANNON at dom.wustl.edu
Mon Jul 21 16:26:48 CEST 2008


You may want to look at kmeans clustering instead of hierarchical if you are interesed in genes with correlated expression patterns across the samples. Imposing a hierarchical structure/model on 10,000 genes is probably incorrect -- gene A and B may be correlated but independent in terms of function, evolutionary history, pathway etc.

In terms of how long it takes you would have to calculate a 10000*(9999)/2 = 49,995,000 element distance matrix -- my best suggestion is start the distance calculation and see if it gets finished in a reasonable amount of time.

Bill Shannon, PhD
Associate Professor of Biostatistics in Medicine
Washington University in St Louis

President-elect, Classificatin Society

________________________________________
From: bioconductor-bounces at stat.math.ethz.ch [bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Nathan Harmston [iwanttobeabadger at googlemail.com]
Sent: Monday, July 21, 2008 8:55 AM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] HOPACH clustering of genes

Hi,

I m currently trying to run some clustering on some expression arrays and I
was wondering about the best way of doing it, I have 81 samples on
hgu133plus2 (55000), I have filtered this down to approximately 10000 (X, Y,
low variabilty, control probes), and wanted to try hierarchical clustering
on these both by arrays and genes. I was planning on using hopach as this
seems an easy and obvious choice. How long would such a lot of comparisons
take? I make it something like  ( 81 * 10000 ) ^ 2 comparisons, I have a
machine with 24gb of memory. Has anybody ever done something like this
before? and what was the amount of time it took to actually do it? Given it
might take a while are there any suggestions for how I might decrease the
running time for such a program? I am already creating the distance matrix
prior to clustering.

Why is it better to use cosangle for gene clustering and euclidean distance
for arrays? Is there a good reason for this and why would you use one
distance over another.

Many thanks in advance,

Nathan

        [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list