[BioC] Count differences between sequences

Martin Morgan mtmorgan at fhcrc.org
Fri Mar 26 14:50:15 CET 2010


On 03/25/2010 03:22 PM, Patrick Aboyoun wrote:
> Erik,
> Judging from your data, I would gather that you are not interested in 
> indels. Is that correct? You should look at the neditStartingAt 
> function. Something like the following may meet your needs:
> 
> N <- length(myStrings)
> myDists <- matrix(0, nrow = N, ncol = N)
> for (i in 1:(N-1))
>      for (j in (i+1):N)
>          myDists[i, j] <- myDists[j, i] <- 
> neditStartingAt(myStrings[[i]], myStrings[[j]])

Here's a psuedo dataset

  library(Biostrings)
  library(drosophila2probe)

  ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"]
  ch <- unlist(strsplit(ch0, ""))
  m0 <- matrix(ch, 500)
  dna <- DNAStringSet(apply(m0, 1, paste, collapse=""))

Suppose we take dna and convert it to a matrix m of M nucleotides x N
sequences. The distance between one sequence (column) of the matrix, say
elt = m[,1], and all the other sequences is

  func <- function(elt, m) colSums(elt != m)

so an inefficient (polynomial time, N^2) algorithm that works ok for
'dna' might be

dnaDist0 <-
    function(dna, func=function(elt, m) colSums(elt != m))
{

    l <- strsplit(as.character(dna), "")
    m <- matrix(unlist(l), unique(width(dna)))
    res0 <- lapply(l, func, m)
    matrix(unlist(res0), ncol=length(res0))
}

so

> system.time(res <- dnaDist0(dna))
   user  system elapsed
 68.624   0.048  68.828

It parallelizes very nicely on linux machines by preceding the above with

> library(multicore)
> lapply <- mclapply
> system.time(res <- dnaDist0(dna))
   user  system elapsed
 35.394   0.324  39.566

A generalization is to suppose a matrix of edit distances, e.g.,

  dist <- 1 - nucleotideSubstitutionMatrix()

and to use indicies into the vector representation of dist (here dist is
a 15 x 15 matrix; it is also a length 255 (= 15 * 15) vector, with
elements 1:15 the first column of dist, 16:30 the second column, etc.)
instead of elt != m as the distance metric. This variant is

  func1 <- function(elt, m, d, dim) {
      x <- d[m + elt]; dim(x) <- dim
      colSums(x)
  }

when the letters of the DNAStringSet in elt, m have been replaced by
their row / column indexes as

dnaDist1 <-
    function(dna, dist, func)
{
    l <- lapply(strsplit(as.character(dna), ""), match, rownames(dist))
    m <- (unlist(l) - 1L) * nrow(dist)
    dim <- c(unique(width(dna)), length(dna))
    res0 <- lapply(l, func, m, dist, dim)
    matrix(unlist(res0), ncol=length(res0))
}

dnaDist1(dna, dist, func1) is quite a bit slower (most of the time is in
func1), taking about 300s as a single thread on my laptop. Memory
consumption is about the same.

As mentioned, the algorithm is inefficient and polynomial in the number
of sequences, so if the problem had been transposed, say 8000 sequences
of 500 nucleotides each, the timing would have been unbearably slow. The
'obvious' optimization of only calculating the lower (or upper) triangle
of distances might cut the time in half, but that doesn't change the
polynomial time. I suspect there is a N * log(N) algorithm that could be
implemented in R, and that perhaps there is something clever with PDict.

Martin

> 
> 
> Patrick
> 
> 
> On 3/25/10 2:57 PM, erikwright at comcast.net wrote:
>> I have 500 DNAStrings, all of length 8000.  I need the entire N x N 
>> distance matrix.
>>
>> Thanks,
>> Erik
>>
>>
>>
>> ----- Original Message -----
>> From: "Patrick Aboyoun" <paboyoun at fhcrc.org>
>> To: erikwright at comcast.net
>> Cc: bioconductor at stat.math.ethz.ch
>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central
>> Subject: Re: [BioC] Count differences between sequences
>>
>> Erik,
>> Could you provide more details on your data? How long are each of the 
>> strings and how many strings do you have? Also, do you need the entire 
>> N x N distance matrix for downstream analysis or are you just looking 
>> for closest relatives?
>>
>>
>> Patrick
>>
>>
>>
>> On 3/25/10 2:29 PM, erikwright at comcast.net wrote:
>>> Hello all,
>>>
>>>
>>> I have a large DNAStringSet and I am trying to calculate its 
>> distance matrix. My DNAStrings are equal width and they are already 
>> aligned.
>>>
>>>
>>> I have tried using the stringDist() function, but it is very slow 
>> for large DNAStringSets. Is there a way to quickly calculate the 
>> number of differences between two DNAString instances?
>>>
>>>
>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I 
>> would like to know if their is a function other than stringDist() that 
>> will tell me the distance between them is 1.
>>>
>>>
>>> Thanks in advance for any help.
>>>
>>>
>>> - Erik
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list