[R] Using statistical test to distinguish two groups

Wed May 5 22:24:01 CEST 2010

Thank you for both your help saving me a a lot of time searching for
the right technique. I have another question regarding clustering:

My data set occasionally has only one cluster, meaning that clustering
is not required in these occasional cases.

Example:

list <- c(767, 773, 766, 772, 778, 777, 777, 758, 764, 771, 773, 768,
770, 752, 762, 769, 770, 768, 763)

Here the data will cluster in two groups (e.g. with kmeans) however,
it in fact only is one. I might have the wrong clustering technique
here; is there a method that considers more closely the effect size
between the groups and can be used to make a decision if clustering
should be done or not. This relates to my former question about the
statistical test.

Is there a different metric for these clustering techniques or is
there one clustering technique that uses some form of a test that
allows me to detects such cases (e.g. only to cluster if differences
between the groups have large effect sizes) and skips otherwise? I
have a feeling that what I am asking here is probably more likely a
pre-processing step... any ideas where I could find a technique that
allows me to find such cases?

Ralf

On Wed, May 5, 2010 at 1:35 PM, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:
> On Wed, 5 May 2010, Ralf B wrote:
>
>> Hi R friends,
>>
>> I am posting this question even though I know that the nature of it is
>> closer to general stats than R. Please let me know if you are aware of
>> a list for general statistical questions:
>>
>> I am looking for a simple method to distinguish two groups of data in
>> a long vector of numbers:
>>
>> list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3)
>>
>> I would like to 'learn' that 400,430 are different numbers by using a
>> simple approach.
>
> It seems that you want to cluster the data. There are, of course, loads of
> clustering algorithms around, see e.g.,
>  http://CRAN.R-project.org/view=Cluster
>
> In this simple example a standard hierarchical clustering approach shows you
> what you're after.
>
> ## data
> list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3)
>
> ## cluster using Ward method for Euclidian distances
> hc <- hclust(dist(list, method = "euclidian"), method = "ward")
> plot(hc)
> hc
>
> ## cut into two clusters
> split(list, cutree(hc, k = 2))
>
> hth,
> Z
>
>> The outcome of processing 'list' should therefore be:
>>
>> listA <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,3,2,4,5,6,4,3,6,4,5,3)
>> listB <- c(400,340)
>>
>> I am thinking a non-parametric test since I have no knowledge of the
>> underlying distribution. The numbers are time differences between two
>> actions recorded from a the same person over time. Because the data
>> was obtained from the same person I would naturally tend to use
>> Wilcoxon Signed-Rank test. Any thoughts on that?
>>
>> Are there any R packages that would process such a vector and use
>> non-parametric methods to split or divide groups based on their
>> values? Could clustering be the answer given that I already know that
>> I always have two groups with a significant difference between the
>> two.
>>
>> Thanks a lot,
>> Ralf
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>