[R] clustering

WeiWei Shi helprhelp at gmail.com
Fri Jan 28 06:19:35 CET 2005


Actually the problem I am trying to solve is to discretize a
continuous variable (which is my response variable (dependent
variable) in my project so that I can make a regression problem into a
classification one. (There are many reasons for doing this.)

Since there is no class label for this variable (because this variable
is my class variable :), the unsupervised approach can be applied
here. However, checking the related papers shows there is little
research (in my knowledge, and I haven't checked the MCC yet) in this
field. Using qqnorm to check the normality and histogram indicates
there might be two normal distributions.

My approach is splitting the values for this variable into 2 or 3
intervals and check each interval's normality again. If some approach
like clustering or the one Andy suggests works well, then I should get
much better normality. I will try that tomorrow.

I am not sure if my idea works or not here, please be advised !

Thanks,

Ed


On Thu, 27 Jan 2005 18:58:28 -0500, Liaw, Andy <andy_liaw at merck.com> wrote:
> It depends a lot on what you know or don't know about the data, and what
> problem you're trying to solve.
> 
> If you know for sure it's a mixture of gaussians, likelihood based
> approaches might be better.  MASS (the book) has an example of fitting
> univariate mixture of gaussians using various optimizers.  The code is even
> in $R_HOME/library/MASS/scripts/ch16.R.
> 
> Andy
> 
> > From: WeiWei Shi
> >
> > Hi,
> > thanks for reply. In fact, I tried both of them and I also tried the
> > other method and I found all of them gave me different boundaries (to
> > my real datasets). I am thinking about k-median but hoping to get more
> > suggestions from all of you in this forum.
> >
> > Cheers,
> >
> > Ed
> >
> >
> > On Thu, 27 Jan 2005 15:37:16 -0600, msck9 at mizzou.edu
> > <msck9 at mizzou.edu> wrote:
> > > The cluster analysis should be able to handle that. I think if you
> > > know how many clusters you have, "kmeans" is ok, or the EM algorithm
> > > can also do that.
> > > On Thu, Jan 27, 2005 at 03:44:42PM -0500, WeiWei Shi wrote:
> > > > Hi,
> > > > I just get a question (sorry if it is a dumb one) and I "phase" my
> > > > question in the following R codes:
> > > >
> > > > group1<-rnorm(n=50, mean=0, sd=1)
> > > > group2<-rnorm(n=20, mean=1, sd=1.5)
> > > > group3<-c(group1,group2)
> > > >
> > > >
> > > > Now, if I am given a dataset from group3, what method
> > (discriminant
> > > > analysis, clustering, maybe) is the best to cluster them
> > by using R.
> > > > The known info includes: 2 clusters, normal distribution (but the
> > > > parameters are unknown).
> > > >
> > > > Thanks,
> > > >
> > > > Ed
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > >
> >
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
> 
> 
> ------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachment...{{dropped}}




More information about the R-help mailing list