[R] calculating DNA mismatch distributions for large populations

Allan Strand stranda at cofc.edu
Sat Oct 6 19:45:28 CEST 2001

Hi all,

I am interested in calculating and displaying the distributions of
pairwise comparisons between DNA sequences in populations.  The
comparisons are the number of nucleotide sites that differ between the
two sequences (mismatches).  My sequences are stored in a vector of
strings.  There is an additional vector of the same length that
provides the indices to the DNA sequences.  Finally, I have output
from table() that gives the distribution of indices in the population.
It looks something like this:

seqindex    1,2,3,4,5
sequences  "ATG","ATC","AAT","CTT","GTC"

table() output of
seqindex frequencies:
  1   2   3    4    5
  6   10  5    1    7

The sequences are usually around 100 characters long (rather than 3)
and there can be several hundred different unique sequences and the
freqencies of sequence indices may total many thousands (often

Right now, I am trying  the following approach:
1) I create a 'distance matrix' of mismatches.  I have been calculating
   the mismatches using the following approach (states() creates a
   list of two vectors: sl$aindex==indices and sl$state==sequences)

            for (i in 1:length(sl[[1]]))
              for (j in i:length(sl[[1]]))
                  if (i!=j)
   this works pretty well, although it is not blindingly fast.

2) I have then tried this approach to get the distribution of
   mismatches for the entire population: 

   I have tried to concatenate X copies of rows of rmat
   corresponding to a particular sequence index.  I'm using
   "rep(rmat[seqindex,],X)". Here, X is the the frequency of that
   sequence index (and sequence) in the population.  I try then to
   repeat this for every sequence index present in the pop.  My
   plan was to then run table() on the resulting vector.  As I
   suspected, trying this with real data brought my 450Mhz/128Meg
   RAM box to its knees and I never had the patience to wait for the
   finished product.  This effect is due to large amounts of memory

My question is: What is a more efficient (in terms of memory and/or
speed) way to solve this problem.  Clearly, this partially brain-dead
biologist has not found it.  I think this question really comes down
to what is the efficient way to solve part 2.  One possibility that
occurred to me is to use table() on each row of rmat, then multiply
the resulting table by the number of times that the sequence is found
in the population.  I'm not clear on how to combine the resulting
tables for all sequences however.  Nor am I clear on whether this is
an improvement on my first approach.  Certainly, it would seem that
this could improve memory usage.

Thanks for any suggestions,

platform i386-unknown-freebsd4.4
arch     i386                   
os       freebsd4.4             
system   i386, freebsd4.4       
major    1                      
minor    3.1                    
year     2001                   
month    08                     
day      31                     
language R

Allan Strand,   Biology    http://linum.cofc.edu
College of Charleston      Ph. (843) 953-8085
Charleston, SC 29424       Fax (843) 953-5453

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list