[BioC] Distance btw sequences from different sets

Tue Mar 19 04:12:30 CET 2013

There is also a recent stringdist package on CRAN.  I have not used it though.

Kasper

On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Benilton,
>
>
> On 03/18/2013 05:20 PM, Benilton Carvalho wrote:
>>
>> Hi,
>>
>> I was wondering what would be the most efficient strategy to get distances
>> between sequences that belong to two different sets. So far, what I am
>> doing is:
>>
>> library(Biostrings)
>> set.seed(1)
>> alph = DNA_ALPHABET[1:4]
>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6)
>> set1 = apply(set1, 1, paste, collapse='')
>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4)
>> set2 = apply(set2, 1, paste, collapse='')
>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist))
>
>
> Note that by default, stringDist() computes the Levenshtein distance
> aka "the edit distance" between strings. This is controlled via the
> 'method' argument. With the toy data you generate above, where all
> the strings have the same length, it often makes sense to use the
> Hamming distance instead i.e. to count the nb of mismatches between
> strings. This is of course up to the user.
>
> However it's important to keep in mind that computing the Hamming
> distance is much faster than computing the Levenshtein distance.
> No surprise: the algo for computing the former is trivial, but the
> algo for computing the latter uses a complicated and expensive
> dynamic programming approach.
>
> That being said, and back to your question, if you don't mind
> calculating a little bit more distances than necessary (and thus
> also using more memory than necessary), you can call stringDist()
> on the big set formed by putting the 2 different sets together,
> and then extract only the part of the big matrix that corresponds
> to the outer product of the 2 original sets:
>
>   outerStringDists1 <- function(x, y, method="levenshtein")
>   {
>     d <- stringDist(c(x, y), method=method)
>     m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)]
>     if (storage.mode(m) != storage.mode(d))
>         storage.mode(m) <- storage.mode(d)
>     dimnames(m) <- NULL
>     t(m)
>   }
>
> With this solution, there are more loops (looping happens inside
> stringDist()) but the looping is much faster (it's done in C).
> In the best case, i.e. when the 2 sets have the same size, this
> will be hundred times faster than using outer() + apply(). Also,
> in that case, the big intermediate matrix will be "only" 4 times
> bigger than the final matrix, which is probably a price that is
> OK to pay for that kind of speed-up. Nonetheless the situation
> will quickly degrade if one set is much bigger than the other,
> e.g. an order of magnitude bigger or more. For example, it would
> not be efficient at all to use outerStringDists1() if 1 set had
> 100k strings and the other 100.
>
> If you are in this situation, and if you want to compute the Hamming
> distance, then here is a solution that uses sapply() over the smallest
> of the 2 sets:
>
>   outerStringDists2 <- function(x, y)
>   {
>     if (!is(x, "XStringSet"))
>         x <- as(x, "XStringSet")
>     if (!is(y, "XStringSet"))
>         y <- as(y, "XStringSet")
>     if (length(x) < length(y))
>         return(t(outerStringDists2(y, x)))
>     if (length(unique(c(width(x), width(y)))) != 1L)
>         stop("strings in 'x' and 'y' must have the same length")
>     sapply(seq_along(y), function(j) neditAt(y[[j]], x))
>   }
>
> Still faster than the outer() + apply() solution but can only compute
> the Hamming distance.
>
> Sounds like maybe we should have an outerStringDists() function in
> Biostrings?
>
> Cheers,
> H.
>
>
>
>>
>> Thanks a lot for any suggestion,
>> benilton
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor