[R] R and Clusters

Gabor Csardi csardi at rmki.kfki.hu
Mon Jan 7 17:37:11 CET 2008


On Mon, Jan 07, 2008 at 05:25:44PM +0100, Lorenzo Isella wrote:
> Thanks for both the replies.
> I am now giving a try to the suggestion by Gabor since it looks easier
> (for me) to implement.
> I am testing it, but so far it does what I have in mind.
> I am going now through the documentation of the igraph package. I can
> count the cluster number, but I also want to make sure that I can
> retrieve the info about which components are in which cluster i.e.
> find out the cluster composition.

That's the 'clusters' function.

> And second question: at the moment, my simulations involve a
> relatively low particle number (of the order of the hundred), but this
> may increase by 1-2 orders of magnitude.
> Which alternative methods to the distance matrix can I use then to
> find the spacing between my particles?

Of course you still need to find out all the distances, but not 
necessarily for all the particles. One way is to sort the particles 
according to (say) the first coordinate and then only check pairs which are 
close enough according to that coordinate, as this is required to 
be close enough in Euclidean distance. E.g. if the sorted coordinates 
are 

x1
x2
x3
x4

then for finding the pairs of particle 1 you only need to check 
particle 2 if x2-x1<delta, etc. the idea is that as soon as the difference
is larger than delta you stop. How well the method works depends of 
the distribution of the particles. For 1000 particle i would guess that the 
regular matrix method works, but probably not for 10000 particles.

Btw. igraph contains C code for doing the sorted distance check trick,
in the igraph_grg_game function in games.c, it can be trivially modified 
to use supplied x and y vectors instead of uniform randomly generated ones.
If you are happy with particles in 2D.

Gabor

> Cheers
>
> Lorenzo
> 
> On 07/01/2008, Gabor Csardi <csardi at rmki.kfki.hu> wrote:
> > Lorenzo, why can't you actually generate the graph to find the
> > connection components? With the 'igraph' package this is something like:
> >
> > g <- graph.adjacency( DIST < 0.5, mode="undirected" )
> > g <- simplify(g)
> > no.clusters(g)
> >
> > assuming you have your distance matrix in 'DIST'. If N is too big
> > then you don't really want to create the distance matrix, but use
> > a more sophisticated approach to find the points that are close
> > to each other than your threshold; then create the graph.
> > So why using clustering algorithms?
> >
> > Btw. from your vector of points you can create the distance matrix
> > by using the 'outer' function.
> >
> > Btw2. your algorithm for graph generation is similar to geometric
> > random graphs, see function 'grg.game' in 'igraph'.
> >
> > Gabor
> >
> > On Mon, Jan 07, 2008 at 03:26:57PM +0100, Lorenzo Isella wrote:
> > > Dear All,
> > > I hope I am not asking a FAQ. I am dealing with a problem of graph
> > > theory [connected components in a non-directed graph] and I do not
> > > want to rediscover the wheel.
> > > I saw a large number of R packages dealing for instance with the
> > > k-means method or hierarchical clustering for spatially distributed
> > > data and I am basically facing a similar problem.
> > > I am given a set of data which are the positions of particles in 3
> > > dimensions; I define two particles A and B to be directly connected if
> > > their Euclidean distance is below a certain threshold d. If A and B
> > > are directly connected and B and C are directly connected, then A,B
> > > and C are connected components (physically it means that they are
> > > members of the same cluster).
> > > All my N particles then split into k disjointed clusters, each with a
> > > certain number of connected components, and this is what I want to
> > > investigate.
> > > I do not know a priori how many clusters I have (this is my problem
> > > with e.g. k-means since k is an output for me); the only input is the
> > > set of 3-dimensional particle positions and a threshold distance.
> > > The algorithm/package I am looking should return the number of
> > > clusters and the composition of each cluster, e.g. the fact that the
> > > second cluster is made up of particles {R,T,L}.
> > > Consider for instance:
> > >
> > > # a 2-dimensional example
> > > x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
> > >            matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
> > > colnames(x) <- c("x", "y")
> > >
> > > How can I then find out how many connected components I have when my
> > > threshold distance is d=0.5?
> > >
> > > Many thanks
> > >
> > > Lorenzo
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Csardi Gabor <csardi at rmki.kfki.hu>    MTA RMKI, ELTE TTK
> >

-- 
Csardi Gabor <csardi at rmki.kfki.hu>    MTA RMKI, ELTE TTK




More information about the R-help mailing list