[BioC] Count differences between sequences

Mon Mar 29 16:29:28 CEST 2010

On 03/29/2010 05:21 AM, Erik Wright wrote:
> Hi Martin,
> 
> Thanks for sharing your method of calculating a distance matrix.  I
> would never have thought of implementing the distance matrix in such
> a way.  Correct me if I am wrong, but any algorithm for computing a
> distance matrix will take at least polynomial time because it
> inherently requires comparing N to (N-1) strings.

Say you had 10 nucleotides with ids 0:9

0 1 2 3 4 5 6 7 8 9
A C A T G C A T T A

If you grouped them as

A: 0, 2, 6, 0
C: 1, 5
G: 4
T: 3, 7, 8

Then you'd know about distances (e.g., Hamming) in a useful way, e.g.,
you could draw a dendrogram with nodes labeled with nucleotide and
corresponding ids, or ask which groups of ids have distance 0 from each
other.

A naive approach for grouping nucleotides would be to look up each in a
'dictionary' of the four possible nucleotides. The dictionary might be
"A", "C", "G", "T", and the look-up would ask, is id 0, with identifier
A, equal to the first entry in the dictionary, "A"? If yes, then
classify id 0 as an "A", if not then look at the next entry in the
dictionary. The worst case scenario would be if the nucleotides were all
"T", then you'd have to make 4 lookups for each identifier before
classifying it. But that's only 10 x 4 comparisons, and not 10 x (10 - 1).

Although the method I mentioned below is still slow compared with a C
implementation, it is about twice as fast as an N x (N - 1) comparison.

> My problem has to do with the handling of gap characters in
> Biostrings methods.  I would like to treat the gap character as a
> wildcard.  So, in another sense, what I am looking for is a way to
> extend the IUPAC_CODE_MAP to include the gap character: 
> IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT")
> 
> That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-")
> would remain zero.  Perhaps the easiest way to do this would be to
> somehow replace all my gaps with the "N" character.

A C level solution will have to wait for Patrick or someone else to
implement it; an R solution might construct an appropriate matrix dist
that includes "-", and use that in f0d (checking that f0d actually works!).

Martin

> Thanks!,
> Erik
> 
> 
> On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote:
> 
>> On 03/28/2010 01:17 PM, Martin Morgan wrote:
>>> Hi Erik et al.,
>>>
>>> Not offering anything practical, but...
>>>
>>> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
>>>> Hi Patrick, Michael, HervÃ©, and Martin,
>>
>>> f0 <- function(dna)
>>> {
>>>    l <- unique(width(dna))
>>>    w <- length(dna)
>>>
>>>    dna0 <- factor(unlist(strsplit(as.character(dna), "")))
>>>    idx <- seq(1, l * w, l)  # 'column' offsets
>>>    i <- seq_len(w)
>>>
>>>    d <- matrix(l, w, w)
>>>    for (j in seq_len(l)-1L) {
>>>        grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
>>> ##         for (k in grp)
>>> ##             d[k, k] = d[k, k] - 1L
>>>    }
>>>    d
>>> }
>>>
>>> Internally, 'split' does its work in 2N comparisons. This important part
>>> is that comparisons are accomplished in linear rather than polynomial time.
>>
>>> same nucleotide (i.e., in k). This is unfortunately the slow part of the
>>> revised code, and it is polynomial time -- in the best case, for 4
>>> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
>>> and in the worst case (all nucleotides identical) we do N^2 updates. We
>>> pay a heavy price for doing this at the R level, and basically get
>>> swamped by the update rather than sort cost.
>>
>>> But we shouldn't lose track of the fact that we can do the sorting much
>>> more efficiently. I've also looked a little at sort.list(,
>>> method="radix"), which orders a factor very quickly.
>>
>> dampening my enthusiasm a bit here, as even implemented in C the update
>> operation still dominates, and the algorithm remains polynomial.
>>
>> Martin
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793