[BioC] Distance btw sequences from different sets

Tue Mar 19 04:42:42 CET 2013

Thanks, Kasper... stringdist does exactly what I need. b

2013/3/19 Benilton Carvalho <beniltoncarvalho at gmail.com>:
> Thanks, Kasper, installing it now. b
>
> 2013/3/19 Kasper Daniel Hansen <kasperdanielhansen at gmail.com>:
>> There is also a recent stringdist package on CRAN.  I have not used it though.
>>
>> Kasper
>>
>> On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>> Hi Benilton,
>>>
>>>
>>> On 03/18/2013 05:20 PM, Benilton Carvalho wrote:
>>>>
>>>> Hi,
>>>>
>>>> I was wondering what would be the most efficient strategy to get distances
>>>> between sequences that belong to two different sets. So far, what I am
>>>> doing is:
>>>>
>>>> library(Biostrings)
>>>> set.seed(1)
>>>> alph = DNA_ALPHABET[1:4]
>>>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6)
>>>> set1 = apply(set1, 1, paste, collapse='')
>>>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4)
>>>> set2 = apply(set2, 1, paste, collapse='')
>>>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist))
>>>
>>>
>>> Note that by default, stringDist() computes the Levenshtein distance
>>> aka "the edit distance" between strings. This is controlled via the
>>> 'method' argument. With the toy data you generate above, where all
>>> the strings have the same length, it often makes sense to use the
>>> Hamming distance instead i.e. to count the nb of mismatches between
>>> strings. This is of course up to the user.
>>>
>>> However it's important to keep in mind that computing the Hamming
>>> distance is much faster than computing the Levenshtein distance.
>>> No surprise: the algo for computing the former is trivial, but the
>>> algo for computing the latter uses a complicated and expensive
>>> dynamic programming approach.
>>>
>>> That being said, and back to your question, if you don't mind
>>> calculating a little bit more distances than necessary (and thus
>>> also using more memory than necessary), you can call stringDist()
>>> on the big set formed by putting the 2 different sets together,
>>> and then extract only the part of the big matrix that corresponds
>>> to the outer product of the 2 original sets:
>>>
>>>   outerStringDists1 <- function(x, y, method="levenshtein")
>>>   {
>>>     d <- stringDist(c(x, y), method=method)
>>>     m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)]
>>>     if (storage.mode(m) != storage.mode(d))
>>>         storage.mode(m) <- storage.mode(d)
>>>     dimnames(m) <- NULL
>>>     t(m)
>>>   }
>>>
>>> With this solution, there are more loops (looping happens inside
>>> stringDist()) but the looping is much faster (it's done in C).
>>> In the best case, i.e. when the 2 sets have the same size, this
>>> will be hundred times faster than using outer() + apply(). Also,
>>> in that case, the big intermediate matrix will be "only" 4 times
>>> bigger than the final matrix, which is probably a price that is
>>> OK to pay for that kind of speed-up. Nonetheless the situation
>>> will quickly degrade if one set is much bigger than the other,
>>> e.g. an order of magnitude bigger or more. For example, it would
>>> not be efficient at all to use outerStringDists1() if 1 set had
>>> 100k strings and the other 100.
>>>
>>> If you are in this situation, and if you want to compute the Hamming
>>> distance, then here is a solution that uses sapply() over the smallest
>>> of the 2 sets:
>>>
>>>   outerStringDists2 <- function(x, y)
>>>   {
>>>     if (!is(x, "XStringSet"))
>>>         x <- as(x, "XStringSet")
>>>     if (!is(y, "XStringSet"))
>>>         y <- as(y, "XStringSet")
>>>     if (length(x) < length(y))
>>>         return(t(outerStringDists2(y, x)))
>>>     if (length(unique(c(width(x), width(y)))) != 1L)
>>>         stop("strings in 'x' and 'y' must have the same length")
>>>     sapply(seq_along(y), function(j) neditAt(y[[j]], x))
>>>   }
>>>
>>> Still faster than the outer() + apply() solution but can only compute
>>> the Hamming distance.
>>>
>>> Sounds like maybe we should have an outerStringDists() function in
>>> Biostrings?
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>>
>>>>
>>>> Thanks a lot for any suggestion,
>>>> benilton
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor