[R] speed and looping issues; calculations on big datasets

Sun Jul 1 01:32:43 CEST 2007

dear r users,

i'm a little stuck with the following problem(s), hopefully somebody  
can offer some help:

i have data organized in a binary matrix, which can become quite big  
like 60 rows x 10^5 columns (they represent SNP genotypes, for some  
background info). what i need to do is the following:

let's suppose i have a matrix of size n x m. for each of the m  
columns, i want to know the counts of unique rows extended one by one  
from the "core" column, for both values at the "core" separately and  
in both directions. maybe better explained with a little example.

data:

00 0 010
10 1 001
11 1 011
10 0 011
10 0 010

so the extended unique rows & counts taking e.g. column 3 as "core" are:

col 3 = 0:
right:
patterns / counts
00 / 3
001 / 3
0010, 0011 / 2,1

left:
00 / 3
000,001 / 1,2

and that for the other subset ( col3 = 1) as well, then doing the  
whole thing again for the next "core" column. the reason i need this  
counts is that i want to calculate frequencies of the different  
extended sequences to calculate the probability of drawing two  
identical sequences from the core up to an extended position from the  
whole set of sequences.

my main problem is speed of the calculations. i tried different ways  
suggested here in the list of getting the counts of the unique rows,  
all of them using the "table" function. both a combination of table 
( do.call( paste, c( as.data.frame( mymatrix) ) ) ) or table( apply 
( mymatrix , 2 , paste , collapse ="" ) ) work fine, but are too slow  
for bigger matrices that i want to calculate (at least in my not very  
sophisticated function). then i found a great suggestion here to do a  
matrix multiplication with a vector of 2^(0:ncol-1) to convert each  
row into a decimal number, and do table on those. this speeds up  
things quite nicely, although the problem is that it of course does  
not work as soon as i extended for more than 60 columns, because the  
decimal numbers get to large to accurately distinguish between a 0  
and 1 at the smallest digit:

 > 2^60+2 == 2^60
[1] TRUE

another thing is that so far i could not come up with an idea on how  
or if it is possible to do this without the loops i am using, one  
large loop for each column in turn as core, and then another loop  
within that extends the rows by growing column numbers. since i am  
not the best of programmers (and still quite new to R), i was hoping  
that somebody has some advice on doing this calculations in a more  
elegant and more importantly, fast way.
just to get the idea, the approach with the matrix multiplication  
takes 20s for a 60 x 220 matrix on my macbook pro, which is obviously  
not perfect, considering i would like to use this function for  
matrices of size 10^2 x 10^5 or even more.

so i would be very thankful for any ideas, suggestions etc to improve  
this

cheers
martin