[BioC] Hierarchical clustering and shrinking centroids...

Tom R. Fahland tfahland at genomatica.com
Wed May 26 01:52:59 CEST 2004


Tan

I have been doing a lot of classification using PAMR, as well as LDA and
SVM's. 
The overused phrase the data is what it is is valid here. I look at
highly correlated samples that mis-classify, and they are usually the
same with differnet classification algorithms. Sometimes I don't get
really good stability with different gene lists also. HC clustering uses
simple correlation metrics, so starting from this can be problematic. I
kow I really didn't answer anything, but thought sharing my experience
might help.
Tom 


-----Original Message-----
From: Tan, MinHan [mailto:MinHan.Tan at vai.org] 
Sent: Monday, May 24, 2004 18:57
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] Hierarchical clustering and shrinking centroids...


Dear list members,
 
I have been unable to resolve this conceptual problem. 
 
I performed hierarchical clustering on a filtered sample (cv=0.04, at
least 2 samples > level of log 9) of 80 tumor samples, and obtained
several groups. Some of these clusters were definitely more stable than
others. Subsequently, based on visual inspection, and my knowledge of
the case outcomes, I arbitrarily classified one large cluster as 'good
prognosis' and other clusters as 'bad prognosis'. 
 
Using this classification obtained above, I did a supervised analysis
using PAMR to obtain a gene list. However, the misclassification rate
during cross-validation for my good prognosis is fairly low and stable
(<0.05) throughout the shrinking gene list, but the misclassification
rate for my poor prognosis case is relatively higher, and also fairly
stable (approx 0.2). I examined the classification of my cases, and some
'poor prognosis' cases seemed to be persistently recognized as 'good
prognosis' cases. Evidently, there is some problem with the
classification arising from the choice of algorithm. I have tried kth
nearest neighbour, and the same problem occurs. Relooking at the HC
tree, some of these good/bad prognosis genes are clustered together,
suggesting other genes 

I wonder how I may explain this -  I suppose the clustering of these
cases is determined by genes other than those differentiating between
these two major groups. Naturally, validation by an independent set is
ideal, but I guess my question is more on this problem of
cross-validation. 
 
I would appreciate any advice, or pointers to any references for this!
 
Thanks.
 
Min-Han Tan
 
 
 
 
 
 
 

This email message, including any attachments, is for the
so...{{dropped}}

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor



More information about the Bioconductor mailing list