[R] Clustering large data matrix

Christian Hennig chrish at stats.ucl.ac.uk
Thu Mar 6 13:18:34 CET 2008


Hi there,

whether clara is a proper way of clustering depends strongly on what your data 
are and particularly what interpretation or use you want for your 
clustering. You may do better with a hierarchical method after having defined a 
proper distance (however this would rather go into statistical consultation and 
not just R help).

Assuming that you use some reasonable dimension reduction and clustering
method, you may get a good visualization of you clustering using the methods 
available via functions plotcluster/discrproj in package fpc.

Best,
Christian

On Thu, 6 Mar 2008, Dani Valverde wrote:

> Hello,
> I have a large data matrix (68x13112), each row corresponding to one
> observation (patients) and each column corresponding to the variables
> (points within an NMR spectrum). I would like to carry out some kind of
> clustering on these data to see how many clusters are there. I have
> tried the function clara() from the package cluster. If I use the matrix
> as is, I can perform the clara analysis but when I call clusplot() I get
> this error:
>
> Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
> 'princomp' can only be used with more units than variables
>
> Then, I reduce the dimensionality by using the function prcomp(). Then I
> take the 13 first principal components (80%< variability) and I carry
> out the clara() analysis again. Then, I call the clusplot() function
> again and voilà!, it works. The problem is that clusplot() only
> represents the two first components of my prcomp() analysis, which
> represents only 15% of the variability.
> So, my questions are 1) is clara() a proper way to analyze such a large
> data set? and 2) Is there an appropiate method for graphic plotting of
> my data, that takes into account the whole variability if my data, not
> just two principal components?
> Many thanks.
> Best,
>
> Dani
>
> -- 
> Daniel Valverde Saubí
>
> Grup de Biologia Molecular de Llevats
> Facultat de Veterinària de la Universitat Autònoma de Barcelona
> Edifici V, Campus UAB
> 08193 Cerdanyola del Vallès- SPAIN
>
> Centro de Investigación Biomédica en Red
> en Bioingeniería, Biomateriales y
> Nanomedicina (CIBER-BBN)
>
> Grup d'Aplicacions Biomèdiques de la RMN
> Facultat de Biociències
> Universitat Autònoma de Barcelona
> Edifici Cs, Campus UAB
> 08193 Cerdanyola del Vallès- SPAIN
> +34 93 5814126
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche


More information about the R-help mailing list