[BioC] vmatchPattern

Hervé Pagès hpages at fhcrc.org
Tue Mar 18 00:44:05 CET 2014


Hi Richard,

On 03/17/2014 12:53 PM, Richard Dannebaum [guest] wrote:
> When will vmatchPattern support indels?
>
>
>
>   -- output of sessionInfo():
>
>> reads[1:5]
>    A DNAStringSet instance of length 5
>      width seq
> [1]    51 NTATTGGTAGCGAATTCCTGTACTGCTTCGACACACTGTACTCTTCAACTT
> [2]    51 NCTTGGGATGCGAATTCCTGTACTGCTTCGCGGTGTCGTACTCTTCAACTT
> [3]    51 NGAGGGTGTTCGAATTCCTGTACTGCTTCGAAATCTGGTACTCTTCAACTT
> [4]    51 NTAGCGTTGACGAATTCCTGTACTGCTTCGCTCCACCGTACTCTTCAACTT
> [5]    51 NCTGTTTGCCCGAATTCCTGTACTGCTTCGTACCGTTGTACTCTTCAACTT
>
>> constant <- vmatchPattern("CGAATTCCTGTACTGCTTCG", reads[1:5], max.mismatch=1, with.indels=TRUE)
>
> Error in .XStringSet.vmatchPattern(pattern, subject, max.mismatch, min.mismatch,  :
>    vmatchPattern() does not support indels yet

Yes that's something I've had on my list for a while. Recently I was
also asked off-list when matchPDict() and family would support indels.
But that's a much harder problem.

For vmatchPattern(), the only show stopper is the unability to store
matches of variable length in an MIndex object, because of the current
design of these objects. But the core algo implemented at the C level
already supports indels.

Here is a version of vmatchPattern() that returns an IRangesList
object instead of an MIndex. It supports indels:

   vmatchPattern2 <- function(pattern, subject,
                              max.mismatch=0, min.mismatch=0,
                              with.indels=FALSE, fixed=TRUE,
                              algorithm="auto")
   {
     if (!is(subject, "XStringSet"))
         subject <- Biostrings:::XStringSet(NULL, subject)
     algo <- Biostrings:::normargAlgorithm(algorithm)
     if (Biostrings:::isCharacterAlgo(algo))
         stop("'subject' must be a single (non-empty) string ",
              "for this algorithm")
     pattern <- Biostrings:::normargPattern(pattern, subject)
     max.mismatch <- Biostrings:::normargMaxMismatch(max.mismatch)
     min.mismatch <- Biostrings:::normargMinMismatch(min.mismatch, 
max.mismatch)
     with.indels <- Biostrings:::normargWithIndels(with.indels)
     fixed <- Biostrings:::normargFixed(fixed, subject)
     algo <- Biostrings:::selectAlgo(algo, pattern,
                                     max.mismatch, min.mismatch,
                                     with.indels, fixed)
     C_ans <- .Call2("XStringSet_vmatch_pattern", pattern, subject,
                     max.mismatch, min.mismatch,
                     with.indels, fixed, algo,
                     "MATCHES_AS_RANGES",
                     PACKAGE="Biostrings")
     unlisted_ans <- IRanges(start=unlist(C_ans[[1L]], use.names=FALSE),
                             width=unlist(C_ans[[2L]], use.names=FALSE))
     relist(unlisted_ans, C_ans[[1L]])
   }

You'll need Biostrings 2.31.15 for this code to work, which should
become available thru biocLite() for BioC devel users in the next 24
hours or so.

A clean fix in Biostrings would be to do this (i.e. change the type 
returned by vmatchPattern()) or to modify the MIndex internals to
support matches of variable length. Probably the former. Changing
the type returned by a function requires some precautions though
because it tends to break existing code.

Hopefully I manage to find some time to work on these things after
the BioC 2.14 release.

Cheers,
H.

>
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list