[R] Regular expressions: offsets of groups

Titus von der Malsburg malsburg at gmail.com
Tue Sep 28 12:52:54 CEST 2010


On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
<michael.bedward at gmail.com> wrote:
> What Titus wants to do is akin to retrieving capturing groups from a
> Matcher object in Java.

Precisely.  Here's the description:

  http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)

Gabor's lookbehind trick solves some special cases but it's not the
kind of general solution I'm looking for.  Let me explain what I'm
trying to achieve here.  I'm working on a package that provides tools
for processing and analyzing eye movements (we're doing reading
research).  In most situations, eye movements consist of fixations
where the eyes are relatively stationary and saccades, quick movements
between fixations.  A common way to represent eye movements is as
strings of symbols, where each symbol corresponds to a fixation on a
particular region.  AABC means two fixations followed by a fixation on
B and then C.  When people analyze eye movements it's often necessary
to find specific events in the eye movement record like: fixations on
the word C preceded by fixations on words D-F and followed by
fixations on words A-C.  This event can be specified using this
regexpr: "[D-F]+(C)[A-C]+"  The group (in parenthesis) indicates the
substring for which I'd like to know the position in the overall
string.  Another application is the extraction of subsequences from a
sequence of fixations.  Note that in some situations people might have
to use more groups in their regexprs and that groups can be nested.
In this case the user would have to indicate for which group he/she
wants to know the offset.  I'm not an expert for regexpr engines but
I'm pretty sure the necessary information is available in the engine.

Gabor, I see you're the author of gsubfn (fantastic package!).  Do you
see a relatively simple way to expose information about group offsets
and their corresponding match lengths?  I think this could be useful
for other applications as well.  At least it seems Michael could use
it, too.  We can cook up something for ourselves but a general
solution would benefit the larger community.

   Titus



More information about the R-help mailing list