[BioC] Count differences between sequences

Hervé Pagès hpages at fhcrc.org
Tue Mar 30 02:09:26 CEST 2010



Erik Wright wrote:
> Hi Martin,
> 
> Thanks for sharing your method of calculating a distance matrix.  I would never have thought of implementing the distance matrix in such a way.  Correct me if I am wrong, but any algorithm for computing a distance matrix will take at least polynomial time because it inherently requires comparing N to (N-1) strings.
> 
> My problem has to do with the handling of gap characters in Biostrings methods.  I would like to treat the gap character as a wildcard.  So, in another sense, what I am looking for is a way to extend the IUPAC_CODE_MAP to include the gap character:
> IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT")
> 
> That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-") would remain zero.  Perhaps the easiest way to do this would be to somehow replace all my gaps with the "N" character.

That's indeed the easy and natural way to do it. But I'm confused now 
because in a previous email you said you wanted the gap letter to *not*
match any other letter, not even itself...

H.

> 
> Thanks!,
> Erik
> 
> 
> On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote:
> 
>> On 03/28/2010 01:17 PM, Martin Morgan wrote:
>>> Hi Erik et al.,
>>>
>>> Not offering anything practical, but...
>>>
>>> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
>>>> Hi Patrick, Michael, Hervé, and Martin,
>>> f0 <- function(dna)
>>> {
>>>    l <- unique(width(dna))
>>>    w <- length(dna)
>>>
>>>    dna0 <- factor(unlist(strsplit(as.character(dna), "")))
>>>    idx <- seq(1, l * w, l)  # 'column' offsets
>>>    i <- seq_len(w)
>>>
>>>    d <- matrix(l, w, w)
>>>    for (j in seq_len(l)-1L) {
>>>        grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
>>> ##         for (k in grp)
>>> ##             d[k, k] = d[k, k] - 1L
>>>    }
>>>    d
>>> }
>>>
>>> Internally, 'split' does its work in 2N comparisons. This important part
>>> is that comparisons are accomplished in linear rather than polynomial time.
>>> same nucleotide (i.e., in k). This is unfortunately the slow part of the
>>> revised code, and it is polynomial time -- in the best case, for 4
>>> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
>>> and in the worst case (all nucleotides identical) we do N^2 updates. We
>>> pay a heavy price for doing this at the R level, and basically get
>>> swamped by the update rather than sort cost.
>>> But we shouldn't lose track of the fact that we can do the sorting much
>>> more efficiently. I've also looked a little at sort.list(,
>>> method="radix"), which orders a factor very quickly.
>> dampening my enthusiasm a bit here, as even implemented in C the update
>> operation still dominates, and the algorithm remains polynomial.
>>
>> Martin
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list