[R] grepping and splitting (with R 2.1.1)

Stefan Th. Gries stgries_lists at arcor.de
Mon Sep 12 16:24:30 CEST 2005

Hi R experts

I have the following regular expression problem. I am writing a basic corpus retrieval program, i.e. a concordancer/function where a user enters
- a set or a directory of text files to search;
- a regular expression to search for in these files.

I want to provide an output in which the matches of the regular expression are listed in one central column and the neighboring columns given the words before and after the matching word. For example, a concordance of the word "the" for the previous sentence with a user-defined span of 3 would lool like this:
-3	-2	-1	0	1	2	3
output	in	which	the	matches	of	the
the	matches	of	the	regular	expression	are
central	column	and	the	neighboring	columns	given
neighboring	columns	given	the	words	before	and
before	and	after	the	matching	word	.

As you can see, there may be multiple hits per line. This works all perfectly fine for cases where the regular expression matches just one of the kind of elements to be separated in the table. 'Unfortunately', apart from 'normal' text files, I also have text files in which every word is preceded by a tag giving its word class, for example

a<-c("<w TO0>to <w VV1>find <w VVN>expected <w TO0>to <w VV2>skivvy <w DT0>much <c PUN>.",
     "<w VVN>seen <w TO0>to <w VV3>kill <w DT0>many")

Now, as long as the regular expression entered by the user is something like
   b<-<w TO0>to
or even
   b<-(?Ui)<w VVN>[^<]*<
this works fine: I identify hits using grep(b, a, perl=T), split up the line using strsplit, and provide as many words before and after my search string as are necessary (and available in the line).

But if the regular expression entered by a user (when prompted by scan(nmax=1, what="char")) is
   b<-b<-"(?Ui)(<w TO0>to <w VV.>[^<]*<)"
I run into several related problems. As you all know, grep and regexpr will only give me the first hit anyway - which is how I identified the lines in the first place - but for the desired output I need all the hits per line together with their context. But, obviously, when I split up the line using strplit and "<w " as a separator so that I can get all hits and all words for the columns -3 to -1 and 1 to 3, the expression matched by the search string b is also split up and cannot be put into one tab-separated central column anymore and I don't seem to be able to extract all hits to store them and insert them again at a later stage ... Basically, I need to split up the element of the vector containing at least one match into x parts, where x is the number of hits plus the number of elements when the surrounding material is split up so that I can generate this kind of display (I leave aside the issue of spaces for now and transpose the above kind of display for expository reasons):

(the first hit in a[1])
0	<w TO0>to <w VV1>find
1	<w VVN>expected
2	<w TO0>to
3	<w VV2>skivvy

and the next line of the output would be the second hit in a[1]:

-3	<w TO0>to
-2	<w VV1>find
-1	<w VVN>expected
0	<w TO0>to <w VV2>skivvy

and the next line would be the only hit in a[2]. The short question after this long intro now is, is there any way of splitting up the elements containing matches in such a way?

I use R 2.1.1 on a Windows XP Pro SP2 machine (with Perl 5.8.7 in case that matters for PRCE). Thanks,

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.

More information about the R-help mailing list