[R] row by row similarity

Sun Apr 6 21:27:47 CEST 2008

Hi Grant,

Grant Gillis wrote:
> My problem is:
> 
> I have a data set for individuals (rows) and values for behaviours
> (columns).  I would like to know the proportion of shared behaviours for all
> possible pairs of individuals.  The sum of shared behaviours divided by the
> total.  There are zeros in the data that I would like treated as the
> behaviour does not exist.
> 
> example data format:
> 
> ind    B1  B2  B3  B4  B5  B6
> w       2    1    5    3    4    4
> x       1    2    3    4    5    6
> y       1    3    5    2    7    6
> z       2    3    2    4    2    6

I hope I understand correctly that the numbers label different
behaviours, hence e.g. individuals 'y' and 'z' have the same level of
behaviour, namely level '3', for the behaviour B2. You may want to look
at R's 'factor's, which allow you to give the levels descriptive names
instead of just numbers.

Let us first make a dataframe out of your example:

t <- data.frame(
    B1 = c(2,1,1,2),
    B2 = c(1,NA,3,3),
    B3 = c(5,2,5,3),
    B4 = c(3,4,2,4),
    B5 = c(4,5,7,2),
    B6 = c(4,6,6,6) )
rownames(t) = c("w","x","y","z")

> t
   B1 B2 B3 B4 B5 B6
w  2  1  5  3  4  4
x  1  2  2  4  5  6
y  1  3  5  2  7  6
z  2  3  3  4  2  6

If you now test two rows for equality, this happens element-wise:

> t["w",] == t["y",]
      B1    B2   B3    B4    B5    B6
w FALSE FALSE TRUE FALSE FALSE FALSE

You can call 'sum' on this output to get the number of TRUE values.

> sum( t["w",] == t["y",] )
[1] 1

As you want to do this with all pairings, we need a nested 'sapply':

> sapply( rownames(t), function(ind1)
+    sapply( rownames(t), function(ind2)
+       sum( t[ind1,] == t[ind2,] ) ) )
   w x y z
w 6 0 1 1
x 0 6 2 2
y 1 2 6 2
z 1 2 2 6

This table now contains the desired information. Of course, you have to
divide by the number of behaviours, i.e. by 6, and the format is a bit
different from your suggestion, but I hope that does not matter.

> Desired output:
> 
> w  x   0
> w  y   0.166667
> w  z   0
> x   y   0.33333
> x   z   0.33333
> etc.

To deal with the missing behaviour you should better use 'NA' instead of
0. Then R may be able to help you with it, as it treats NAs, i.e. values
marked as missing, in a special way.

Assume, for example, that you compare the rows

> r1 <- c( 2, 3, NA, 1, 5 )
> r2 <- c( 1, 3, 4, NA, 4 )

Calling '==' as above on such data yields:

> r1==r2
[1] FALSE  TRUE    NA    NA FALSE

As you can see, the missing behaviour is marked NA, because it is
uncomparable. To get the number of TRUE values, use

> sum( r1==r2, na.rm=TRUE )
[1] 1

And to get the number of comparable observations, i.e. those without NA,
use e.g.

> length( na.omit( r1==r2 ) )
[1] 3

I hope this helps you to work out your own solution. Otherwise, ask again.

Best
   Simon

+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| preferred (permanent) e-mail: sanders at fs.tum.de