[R] clustering of binary data

David L Carlson dcarlson at tamu.edu
Thu Dec 6 21:54:04 CET 2012


Do not use html in r-help emails. Look below at what happens to your data.
The error message is telling you that t(data) is not numeric. 

> str(data)

That will tell you what kind of data you have. 

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of marco milella
> Sent: Thursday, December 06, 2012 12:08 PM
> To: r-help at r-project.org
> Subject: [R] clustering of binary data
> 
> Good morning,
> I am analyzing a dataset composed by 364 subjects and 13 binary
> variables
> (0,1 = absence,presence).
> I am testing possible association (co-presence) of my variables. To do
> this, I was trying with cluster analysis.
> 
> My main interest is to check for the significance of the obtained
> clusters.
> 
> First, I tried with the pvclust() function, by using
> method.hclust="ward"
> and method.dist="binary". Altoghether it works (clusters and
> significance
> obtained). However, I'm not convinced by the distance matrix.
> Association
> between variables are indeed different from results obtained in PAST by
> using Ward on a Jaccard matrix (that should be ok for binary data).
> Moreover, when I try to obtain a Jaccard matrix in R from my data, by
> using
> the Vegan package
> 
> mydistance<-vegdist(t(data),method="jaccard")
> 
>  I receive the following error message:
> 
> Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric
> 
> 
> below an subset from my dataset:
> 
>        variable1 variable2 variable3 variable4 variable5 variable6
> variable7
> variable8 variable9 variable10 variable11 variable12 variable13  case1
> 0 0 0
> 0 0 1 0 0 1 1 0 0 0  case2 0 0 0 0 0 1 0 NA NA 1 0 0 0  case3 0 0 0 0 0
> 1 0
> 0 1 1 0 0 0  case4 1 0 0 0 0 1 0 1 0 1 0 0 0  case5 0 0 0 0 0 1 0 0 1 1
> 0 0
> 0  case6 0 1 0 0 0 1 0 1 0 1 0 0 0  case7 0 1 0 0 0 1 0 0 1 1 0 0 0
> case8 0
> 0 0 0 0 1 0 1 0 1 0 0 0  case9 0 0 0 0 0 1 0 1 0 1 0 0 0  case10 0 0 0
> 0 0 1
> 0 0 1 1 0 0 0  case11 1 0 0 1 0 1 1 1 0 1 0 0 0  case12 0 0 0 1 1 0 1 1
> 0 1
> 0 0 0  .....
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> So, my questions are the following: Is the Jaccard index a good
> strategy
> for my kind of data? Is binary distance used in pvclust is
> theoretically
> more correct? Is there any alternative to pvclust for testing the
> significance of my clusters?
> 
> Thanks in advance
> Marco
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list