[BioC] questions about matchPattern and vmatchPattern

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Nov 1 19:44:11 CET 2012


Hi,

On Thu, Nov 1, 2012 at 2:07 PM, wang peter <wng.peter at gmail.com> wrote:
>> thx for your reply
>> i donot think the manual can answer my question
>>
>>> For example, see inline:
>>>
>>>> subject = "TGCATTT"
>>>> Rpattern = "TGCAATTT"
>>>> result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0)
>>>> result
>>>>   Views on a 7-letter BString subject
>>>> subject: TGCATTT
>>>> views:
>>>>     start end width
>>>> [1]     0   7     8 [ TGCATTT]
>>>> [2]     1   8     8 [TGCATTT ]
>>
>> using my ass,i think it is position on the subject. but 0 and 8 are
>> out of border of subject

This is why, in general, it's a good idea not to think with your ass.

If you read the description of matchPattern in its help file, you see
right at the top:

"""
A set of functions for finding all the occurrences (aka "matches" or
"hits") of a given pattern (typically short) in a (typically long)
reference sequence or set of reference sequences (aka the subject)
"""

This case, you are doing the reverse -- searching for a longer pattern
than the subject. This wasn't what it was intended for, but ... fine.

The result is telling you where the theoretical begin and end would be
given your constraints (subject, pattern, and max.mismatch values).

The fact that these results seem weird to you  -- one starts at 0 (and
also has a space in its first postion), and the other overhangs the
end -- should give you an idea of what to look for if you expect such
error conditions.

The fact that you are allowing for (so many) mismatches (half the
length of your pattern) I guess also brought you to this place.

If your problem is how this "oddity" is reported, then I grant that
this might be something worth talking about, and you are free to raise
the issue if you have a better way to handle this.

FWIW, I think the current response is a reasonable result to return,
but I'd grant that it's worth adding a note of in the docs for this
case -- perhaps you would like to provide a patch to the documentation
describing this scenario.

Out of curiosity, what should the function do if the pattern is 2x,
5x, or 10x longer than the subject? Anything? Nothing? `stop()`?

But, this wasn't your question. Your (paraphrased) question was
wondering about the result of matchPattern and whether or not the
coordinates returned are for the pattern or the subject ... and, as I
suggested, by reading the docs and trying some toy examples, the
answer is obvious.

>>
>> absolutely i know what is MIndex object
>> but you never answer me
>> if i use
>>
>> startIndex(result)
>>
>> it will return all of hits of on your subject or just the first one????

I didn't answer you because I suggested that you should (1) read the
docs a bit more carefully; and (more importantly!) (2) do some
exploratory analysis for yourself before you bring your question back
to the list, but since you couldn't be bothered to do either,  allow
me to stop what I'm doing so that I can do it for you instead:

R> m <- vmatchPattern("GATACA", DNAStringSet(c("GGATACACCCGATACACC",
"CCCCCCCCCGATACA")))
R> startIndex(m)
[[1]]
[1]  2 11

[[2]]
[1] 10

Is that clear now?

Look: please try and read the docs and explore your "problems" a bit
more before posting to the list -- everyone is quite busy, but still
try to help. When it's clear that the poster doesn't do "their
homework" before posting a question, it can become quite frustrating
(for me, at least).

I don't think anybody would mind suggested enhancements to the
documentation, so if you have those -- feel free to share. For
example, your first question might have been avoided if it was noted
more clearly -- but if you read the docs and understand *the
intention* of the function, then take a moment to think about the
result you got, I think the results can be explained in a rather
intuitive/obvious way. But still -- as I said -- I think *well thought
out and written* suggestions to fix the documentation will generally
be received warmly.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list