[R] help understanding hierarchical clustering

David Carlson dcarlson at tamu.edu
Thu May 2 17:34:45 CEST 2013

That clears up a great deal. Each row of your data represents the
observation of a particular species on a particular image. You are actually
clustering localities (images) and you want to know what species are
commonly found in localities with similar temp/sal/depth/subs. 

Your current approach is to cluster multiple rows of the same image which is
cluttering up the cluster analysis. A more productive approach would be to
create two data tables (or one very long one) each with one row for each
image as you indicated at the end or your message:

1. Image Name (or a numeric ID)
2. st_x
3. st_y
4. Temp
5. Sal
6. Depth_M
7. Subs
8. Count_id_1
9. Count_id_2
. . . .
N+7. Count_id_n

This would also allow you to compute species diversity and density for each
image that could be added to the table.

To get there from your data, you need to create a table of images:

> dd <- na.omit(spmat)
> dd.images <- unique(dd[,3:9])
> nrow(dd.images)
[1] 1763
> length(levels(dd$imagename))
[1] 1710

 So dd.images contains 53 more rows than the number of images! I've
spot-checked this and it seems to be cases where two different "subs" values
were assigned to the same image. To match the image file with the species
file (below), each image needs to be included only once.

To get a table of species composition:
> dd.species <- xtabs(count~imagename+idcode, dd)
> str(dd.species)
 xtabs [1:1710, 1:20] 0 0 0 0 0 1 0 1 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ imagename: chr [1:1710] "UNQ.20080414.150557936.90579.jpg"
"UNQ.20080414.150600152.90589.jpg" "UNQ.20080414.150602167.90599.jpg"
"UNQ.20080414.150604182.90609.jpg" ...
  ..$ idcode   : chr [1:20] "10008" "10022" "10024" "11010" ...
 - attr(*, "class")= chr [1:2] "xtabs" "table"
 - attr(*, "call")= language xtabs(formula = count ~ imagename + idcode,
data = dd)

With this approach you could use more of the original 100 species in the
analysis or even all of them.

Now you can cluster the images into similar groups and look at the
distribution of species in each cluster. Then use cuttree to produce a
vector of cluster memberships to see which images fall into the same

David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

From: epi [mailto:massimodisasha at gmail.com] 
Sent: Wednesday, May 1, 2013 9:32 PM
To: dcarlson at tamu.edu
Cc: r-help at r-project.org
Subject: Re: [R] help understanding hierarchical clustering

Hi David,

thank yuou so much for helping me!

Il giorno 01/mag/2013, alle ore 10:16, David Carlson <dcarlson at tamu.edu> ha

You need to clarify what you are trying to achieve and fix some errors in
your code. First, thanks for giving us reproducible data. 

i tried to fix the errors , thanks for your advice!

Once you have read the file, you seem to be attempting to remove cases with
missing values, but you check for missing values of "count" twice and you
never check "depth." The whole line can be replaced with

dd <- na.omit(mat)

Now you have data with complete cases. In your next step you create a
distance matrix that includes "idcode" as a variable! Although it is
numeric, it is really a categorical variable. That suggests you need to read
up on R and cluster analysis. It is very likely that you want to exclude
this variable from the distance matrix and possibly the "count" variable as

 i excluded idcode and count from the distance matrix

What does one row of data represent? You have 8036 complete cases
representing data on 100 species. There are great differences in the number
of rows for each species (idcode) ranging from 1 to 1066. 

i'm trying to clean-up the data,  i removed all the records where the
species "idcode" is found less than 100 times

I uploaded a new link to the new-data and code [1]

is this correct ?
can i go further and try to understand which species are assigned for each
branch of the dendrogram at a specified "cut-level" ?

thanks All for any further help!


[1] http://nbviewer.ipython.org/5499800

David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of epi
Sent: Tuesday, April 30, 2013 8:06 PM
To: r-help at r-project.org
Subject: [R] help understanding hierarchical clustering

Hi All,

i've problem to understand how to work with R to generate a hierarchical
clustering my data are in a csv and looks like :


where idcode is a specie identification number and the other fields are
environmental parameters.

dd <- mat[!is.na(mat$idcode) &
             !is.na(mat$temp) &
             !is.na(mat$sal) &
             !is.na(mat$count) &
             !is.na(mat$count) &
	hclust(d = distmat, method = "average")
	Cluster method   : average 
	Distance         : bray 
	Number of objects: 8036
print(dend1 <- as.dendrogram(clusa))
	'dendrogram' with 2 branches and 8036 members total, at height
dend2 <- cut(dend1, h=0.07)

a complete run with plots is available here :  


i'm trying try to group together the species (idcode's) that are sharing
similar environmental parameters

like (looking at the plots) i should be able to retrieve the list of idcode
for each branch at "cut-level" X

in the example :  

X = 0.07 

branches1 : [idcodeA, .. .. ,idcodeJ]
branche6 : [idcodeB, .. .. , idcodeK]

Many thanks for your precious help!!!


	[[alternative HTML version deleted]]

R-help at r-project.org mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list