# [R] Using statistical test to distinguish two groups

Liaw, Andy andy_liaw at merck.com
Thu May 6 16:03:24 CEST 2010

```I can't resist being a bit philosophical here.  I guess that's one sign of aging...

You can't form hypotheses and "prove" them with the same data, even if you use different statistical (or something else) methods for the two steps.  That, to me, is self-fulfilling prophecy.

I get the feeling that many users of statistics have the tendency to want to find a statistical test as a way of objectively verifying something.  This is probably a good indicator of the urgent need to promote statistical thinking.  P values are not holy water.

Since you have not described the context of your problem, it's hard to give you better advise, but there are methods for estimating number of clusters in a set of data (or somewhat equivalently, estimating number of components in a mixture model).  You may want to look into that and see if such approach meets your need.

Andy

From: Ralf B
>
> Thank you for both your help saving me a a lot of time
> searching for the right technique. I have another question
> regarding clustering:
>
> My data set occasionally has only one cluster, meaning that
> clustering is not required in these occasional cases.
>
> Example:
>
> list <- c(767, 773, 766, 772, 778, 777, 777, 758, 764, 771,
> 773, 768, 770, 752, 762, 769, 770, 768, 763)
>
> Here the data will cluster in two groups (e.g. with kmeans)
> however, it in fact only is one. I might have the wrong
> clustering technique here; is there a method that considers
> more closely the effect size between the groups and can be
> used to make a decision if clustering should be done or not.
> This relates to my former question about the statistical test.
>
> Is there a different metric for these clustering techniques
> or is there one clustering technique that uses some form of a
> test that allows me to detects such cases (e.g. only to
> cluster if differences between the groups have large effect
> sizes) and skips otherwise? I have a feeling that what I am
> asking here is probably more likely a pre-processing step...
> any ideas where I could find a technique that allows me to
> find such cases?
>
> Ralf
>
>
>
>
> On Wed, May 5, 2010 at 1:35 PM, Achim Zeileis
> <Achim.Zeileis at uibk.ac.at> wrote:
> > On Wed, 5 May 2010, Ralf B wrote:
> >
> >> Hi R friends,
> >>
> >> I am posting this question even though I know that the
> nature of it
> >> is closer to general stats than R. Please let me know if you are
> >> aware of a list for general statistical questions:
> >>
> >> I am looking for a simple method to distinguish two groups
> of data in
> >> a long vector of numbers:
> >>
> >> list <-
> c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3)
> >>
> >> I would like to 'learn' that 400,430 are different numbers
> by using a
> >> simple approach.
> >
> > It seems that you want to cluster the data. There are, of course,
> > loads of clustering algorithms around, see e.g.,
> >  http://CRAN.R-project.org/view=Cluster
> >
> > In this simple example a standard hierarchical clustering approach
> > shows you what you're after.
> >
> > ## data
> > list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3)
> >
> > ## cluster using Ward method for Euclidian distances hc <-
> > hclust(dist(list, method = "euclidian"), method = "ward")
> > plot(hc)
> > hc
> >
> > ## cut into two clusters
> > split(list, cutree(hc, k = 2))
> >
> > hth,
> > Z
> >
> >> The outcome of processing 'list' should therefore be:
> >>
> >> listA <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,3,2,4,5,6,4,3,6,4,5,3)
> >> listB <- c(400,340)
> >>
> >> I am thinking a non-parametric test since I have no
> knowledge of the
> >> underlying distribution. The numbers are time differences
> between two
> >> actions recorded from a the same person over time. Because
> the data
> >> was obtained from the same person I would naturally tend to use
> >> Wilcoxon Signed-Rank test. Any thoughts on that?
> >>
> >> Are there any R packages that would process such a vector and use
> >> non-parametric methods to split or divide groups based on their
> >> values? Could clustering be the answer given that I
> >> I always have two groups with a significant difference between the
> >> two.
> >>
> >> Thanks a lot,
> >> Ralf
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help