helprhelp at gmail.com
Tue Jul 26 21:38:07 CEST 2005
You are right and It IS too general. I think I should ask like "what
kind of cluster algorithms or functions are available in R" , which
might be easier. But for that, I probably can google or use help() in
R to find out. I want to know more about the performance of clustering
on this kind of problems and hope someone can share previous experince
if he/she had similar situation or problems before. And I will share
my experience later :)
As to the reason of using downsampling here, it is one fo the
straightforward ways to deal with imbalanced data classification
problem. In my understanding of classification problems, among others,
two things are important: feature construction/selection and sample
selection. I had an idea (which might be discovered by others) that
finding the best subset of features in clustering (to get highest
inter-cluster dissimilarities and the largest intra-cluster
similarity) might help the next classification process. I quickly read
through the abstract of your paper and I think your approach here is
applying feature selection (use p instead of n), while here, in my
proposal, I would like to try both.
thanks for further advice!
On 7/26/05, Christian Hennig <chrish at stats.ucl.ac.uk> wrote:
> Dear Weiwei,
> your question sounds a bit too general and complicated for the R-list.
> Perhaps you should look for personal statistical advice.
> The quality of methods (and especially distance choice) for down-sampling
> ceratinly depends on the structure of the data set. I do not see at the moment why
> you need any down-sampling at all, and you should find out first if and
> why it's a good thing to do (by whatever method).
> An obvious candidate for a clustering algorithm would be pam/clara in
> package cluster, because this approach chooses points already in the data
> set as cluster centroids (and produces therefore a proper subsample),
> which does not apply to most other clustering methods.
> However, in
> C. Hennig and L. J. Latecki: The choice of vantage objects for image
> retrieval. Pattern Recognition 36 (2003), 2187-2196.
> the clustering approach has been clearly outperformed by some stepwise
> selection approaches for down-sampling - admittedly in a different kind of
> problem, but I think that the reasons for this may apply also to your
> You can compare different clusterings (or choices of a subset) by
> cross-validation or
> bootstrap applied to the resulting decision tree in the classification
> On Mon, 25 Jul 2005, Weiwei Shi wrote:
> > Dear listers:
> > Here I have a question on clustering methods available in R. I am
> > trying to down-sampling the majority class in a classification problem
> > on an imbalanced dataset. Since I don't want to lose information in
> > the original dataset, I don't want to use naive down-sampling: I think
> > using clustering on the majority class' side to select
> > "representative" samples might help. So, my question is, which
> > clustering method should be tested to get the best result. I think the
> > key thing might be the selection of "distance" considering the next
> > step in which I would like to use decision trees.
> > Please share your experience in using clustering (Any available
> > implementation outside R is also welcome)
> > weiwei
> > --
> > Weiwei Shi, Ph.D
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> *** NEW ADDRESS! ***
> Christian Hennig
> University College London, Department of Statistical Science
> Gower St., London WC1E 6BT, phone +44 207 679 1698
> chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
Weiwei Shi, Ph.D
"Did you always know?"
"No, I did not. But I believed..."
More information about the R-help