[BioC] questions about matchPattern and vmatchPattern

Hervé Pagès hpages at fhcrc.org
Thu Nov 1 20:44:17 CET 2012


Hi,

The general comments/recommendations Steve is giving you are
worth reading and I hope they will help you improve how you
ask questions on this list.

I also wanted to mention that in Biostrings 2.27.6 (BioC devel)
I've improved the "show" method for MIndex objects so now they
are displayed like other RangesList objects (which make them more
user-friendly). With Steve's example:

   > m <- vmatchPattern("GATACA", DNAStringSet(c("GGATACACCCGATACACC", 
"CCCCCCCCCGATACA")))
   > m
   MIndex object of length 2
   [[1]]
   IRanges of length 2
       start end width
   [1]     2   7     6
   [2]    11  16     6

   [[2]]
   IRanges of length 1
       start end width
   [1]    10  15     6

Note that, with matchPattern/vmatchPattern/matchPDict, overlapping
matches are reported (which is not the case with grep and family):

   > vmatchPattern("GAGA", DNAStringSet(c("CCGAGAGAT", "GACGATA")))
   MIndex object of length 2
   [[1]]
   IRanges of length 2
       start end width
   [1]     3   6     4
   [2]     5   8     4

   [[2]]
   IRanges of length 0

Historically MIndex objects predate RangesList objects and that
explains the odd interface like startIndex etc... They are
also lagging behind RangesList in terms of functionalities. I've
had on my list for a long time to modernize them. Hopefully soon.

Cheers,
H.


On 11/01/2012 11:44 AM, Steve Lianoglou wrote:
> Hi,
>
> On Thu, Nov 1, 2012 at 2:07 PM, wang peter <wng.peter at gmail.com> wrote:
>>> thx for your reply
>>> i donot think the manual can answer my question
>>>
>>>> For example, see inline:
>>>>
>>>>> subject = "TGCATTT"
>>>>> Rpattern = "TGCAATTT"
>>>>> result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0)
>>>>> result
>>>>>    Views on a 7-letter BString subject
>>>>> subject: TGCATTT
>>>>> views:
>>>>>      start end width
>>>>> [1]     0   7     8 [ TGCATTT]
>>>>> [2]     1   8     8 [TGCATTT ]
>>>
>>> using my ass,i think it is position on the subject. but 0 and 8 are
>>> out of border of subject
>
> This is why, in general, it's a good idea not to think with your ass.
>
> If you read the description of matchPattern in its help file, you see
> right at the top:
>
> """
> A set of functions for finding all the occurrences (aka "matches" or
> "hits") of a given pattern (typically short) in a (typically long)
> reference sequence or set of reference sequences (aka the subject)
> """
>
> This case, you are doing the reverse -- searching for a longer pattern
> than the subject. This wasn't what it was intended for, but ... fine.
>
> The result is telling you where the theoretical begin and end would be
> given your constraints (subject, pattern, and max.mismatch values).
>
> The fact that these results seem weird to you  -- one starts at 0 (and
> also has a space in its first postion), and the other overhangs the
> end -- should give you an idea of what to look for if you expect such
> error conditions.
>
> The fact that you are allowing for (so many) mismatches (half the
> length of your pattern) I guess also brought you to this place.
>
> If your problem is how this "oddity" is reported, then I grant that
> this might be something worth talking about, and you are free to raise
> the issue if you have a better way to handle this.
>
> FWIW, I think the current response is a reasonable result to return,
> but I'd grant that it's worth adding a note of in the docs for this
> case -- perhaps you would like to provide a patch to the documentation
> describing this scenario.
>
> Out of curiosity, what should the function do if the pattern is 2x,
> 5x, or 10x longer than the subject? Anything? Nothing? `stop()`?
>
> But, this wasn't your question. Your (paraphrased) question was
> wondering about the result of matchPattern and whether or not the
> coordinates returned are for the pattern or the subject ... and, as I
> suggested, by reading the docs and trying some toy examples, the
> answer is obvious.
>
>>>
>>> absolutely i know what is MIndex object
>>> but you never answer me
>>> if i use
>>>
>>> startIndex(result)
>>>
>>> it will return all of hits of on your subject or just the first one????
>
> I didn't answer you because I suggested that you should (1) read the
> docs a bit more carefully; and (more importantly!) (2) do some
> exploratory analysis for yourself before you bring your question back
> to the list, but since you couldn't be bothered to do either,  allow
> me to stop what I'm doing so that I can do it for you instead:
>
> R> m <- vmatchPattern("GATACA", DNAStringSet(c("GGATACACCCGATACACC",
> "CCCCCCCCCGATACA")))
> R> startIndex(m)
> [[1]]
> [1]  2 11
>
> [[2]]
> [1] 10
>
> Is that clear now?
>
> Look: please try and read the docs and explore your "problems" a bit
> more before posting to the list -- everyone is quite busy, but still
> try to help. When it's clear that the poster doesn't do "their
> homework" before posting a question, it can become quite frustrating
> (for me, at least).
>
> I don't think anybody would mind suggested enhancements to the
> documentation, so if you have those -- feel free to share. For
> example, your first question might have been avoided if it was noted
> more clearly -- but if you read the docs and understand *the
> intention* of the function, then take a moment to think about the
> result you got, I think the results can be explained in a rather
> intuitive/obvious way. But still -- as I said -- I think *well thought
> out and written* suggestions to fix the documentation will generally
> be received warmly.
>
> -steve
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list