[R] distance metrics

Jari Oksanen jarioksa at sun3.oulu.fi
Tue Mar 13 09:00:56 CET 2007


On Tue Mar 13 00:21:22 CET 2007 Gavin Simpson wrote:
> On Mon, 2007-03-12 at 16:02 -0700, Sender wrote:
>  Thanks for the suggestion Christian. I'm trying to avoid expanding the dist
> > object to a matrix, since i'm usually working with microarray data which
> > produces a distance matrix of size 5000 x 5000.
> > 
> > If i can keep it in its condensed form i think it will speed things up.
> > 
> > Is my thinking correct?
> 
> That will all depend on what you want to do with it...
> 
> A dist object of that size is c. 100 MB in memory, and c. 200 MB in size
> as the full dissimilarity matrix - values from object.size(). Of course,
> you'll need a reasonable amount of free memory over and above this to do
> anything useful with the matrix as copies may be required during
> analysis/processing etc.
> 
> Of course, a dist object is just a vector of observed distances with
> various attributes, so one can always use "[" for vectors, but I imagine
> that anything other than trivial operations will become fiddly,
> complicated and time consuming - if you have the memory, give the
> as.matrix option a try and see how it works for your specific problems.
> 
Such a fiddling could be a function that returns the index in the dist vector:
 
idx <- function(i, j, Size) 
{ 
  a <- min(i,j) 
  b <- max(i,j) 
  Size*(a-1) - a*(a-1)/2 + b - a 
} 

where i and j are the desired matrix indices and Size is the number of
observations, or the attribute "Size" of a 'dist' object. (The function
will fail if i==j or any(c(i,j) > Size) and with some other potential
abuse.)

You can refer to your individual distances from 5000 observations as:

dis[idx(2417, 1105, 5000)]

This is slower, of course, but avoids expanding to a matrix. 

Perhaps a nicer and easier to use (but more opaque) way is to write the
function as:

getidx <- function(dist, i, j) 
{
    dist[idx(i, j, attr(dist, "Size"))]
}

which can be used with fewer bracket types: getidx(dist, 2417, 1105).

cheers, jari oksanen



More information about the R-help mailing list