[R] help understanding hierarchical clustering

David Carlson dcarlson at tamu.edu
Wed May 1 16:16:25 CEST 2013


You need to clarify what you are trying to achieve and fix some errors in
your code. First, thanks for giving us reproducible data. 

Once you have read the file, you seem to be attempting to remove cases with
missing values, but you check for missing values of "count" twice and you
never check "depth." The whole line can be replaced with

dd <- na.omit(mat)

Now you have data with complete cases. In your next step you create a
distance matrix that includes "idcode" as a variable! Although it is
numeric, it is really a categorical variable. That suggests you need to read
up on R and cluster analysis. It is very likely that you want to exclude
this variable from the distance matrix and possibly the "count" variable as
well. 

What does one row of data represent? You have 8036 complete cases
representing data on 100 species. There are great differences in the number
of rows for each species (idcode) ranging from 1 to 1066. 

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of epi
Sent: Tuesday, April 30, 2013 8:06 PM
To: r-help at r-project.org
Subject: [R] help understanding hierarchical clustering

Hi All,

i've problem to understand how to work with R to generate a hierarchical
clustering my data are in a csv and looks like :

idcode,count,temp,sal,depth_m,subs
16001,136,4.308,32.828,63.46,47
16001,109,4.31,32.829,63.09,49
16001,107,4.302,32.822,62.54,47
16001,87,4.318,32.834,62.54,48
16002,82,4.312,32.832,63.28,49
16002,77,4.325,32.828,65.65,46
16002,77,4.302,32.821,62.36,47
16002,71,4.299,32.832,65.84,37
16002,70,4.302,32.821,62.54,49

where idcode is a specie identification number and the other fields are
environmental parameters.

library(vegan)
mat<-read.csv("http://epi.whoi.edu/ipython/results/mdistefano/pg_site1.csv",
header=T)
dd <- mat[!is.na(mat$idcode) &
              !is.na(mat$temp) &
              !is.na(mat$sal) &
              !is.na(mat$count) &
              !is.na(mat$count) &
              !is.na(mat$subs),]
distmat<-vegdist(dd)
clusa<-hclust(distmat,"average")
print(clusa)
	Call:
	hclust(d = distmat, method = "average")
	
	Cluster method   : average 
	Distance         : bray 
	Number of objects: 8036
print(dend1 <- as.dendrogram(clusa))
	'dendrogram' with 2 branches and 8036 members total, at height
0.3194225
dend2 <- cut(dend1, h=0.07)


a complete run with plots is available here :  

http://nbviewer.ipython.org/5492912

i'm trying try to group together the species (idcode's) that are sharing
similar environmental parameters

like (looking at the plots) i should be able to retrieve the list of idcode
for each branch at "cut-level" X

in the example :  


X = 0.07 

branches1 : [idcodeA, .. .. ,idcodeJ]
..
..
branche6 : [idcodeB, .. .. , idcodeK]



Many thanks for your precious help!!!

Massimo.



	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list