[BioC] Count differences between sequences

Mon Mar 29 14:21:37 CEST 2010

Hi Martin,

Thanks for sharing your method of calculating a distance matrix.  I would never have thought of implementing the distance matrix in such a way.  Correct me if I am wrong, but any algorithm for computing a distance matrix will take at least polynomial time because it inherently requires comparing N to (N-1) strings.

My problem has to do with the handling of gap characters in Biostrings methods.  I would like to treat the gap character as a wildcard.  So, in another sense, what I am looking for is a way to extend the IUPAC_CODE_MAP to include the gap character:
IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT")

That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-") would remain zero.  Perhaps the easiest way to do this would be to somehow replace all my gaps with the "N" character.

Thanks!,
Erik

On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote:

> On 03/28/2010 01:17 PM, Martin Morgan wrote:
>> Hi Erik et al.,
>> 
>> Not offering anything practical, but...
>> 
>> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
>>> Hi Patrick, Michael, HervÃ©, and Martin,
> 
>> f0 <- function(dna)
>> {
>>    l <- unique(width(dna))
>>    w <- length(dna)
>> 
>>    dna0 <- factor(unlist(strsplit(as.character(dna), "")))
>>    idx <- seq(1, l * w, l)  # 'column' offsets
>>    i <- seq_len(w)
>> 
>>    d <- matrix(l, w, w)
>>    for (j in seq_len(l)-1L) {
>>        grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
>> ##         for (k in grp)
>> ##             d[k, k] = d[k, k] - 1L
>>    }
>>    d
>> }
>> 
>> Internally, 'split' does its work in 2N comparisons. This important part
>> is that comparisons are accomplished in linear rather than polynomial time.
> 
>> same nucleotide (i.e., in k). This is unfortunately the slow part of the
>> revised code, and it is polynomial time -- in the best case, for 4
>> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
>> and in the worst case (all nucleotides identical) we do N^2 updates. We
>> pay a heavy price for doing this at the R level, and basically get
>> swamped by the update rather than sort cost.
> 
>> But we shouldn't lose track of the fact that we can do the sorting much
>> more efficiently. I've also looked a little at sort.list(,
>> method="radix"), which orders a factor very quickly.
> 
> dampening my enthusiasm a bit here, as even implemented in C the update
> operation still dominates, and the algorithm remains polynomial.
> 
> Martin
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor