[R] using regular expressions to retrieve a digit-digit-dotstructure from a string

Tue Jun 9 17:29:55 CEST 2009

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Mark Heckmann
> Sent: Tuesday, June 09, 2009 4:45 AM
> To: r-help at r-project.org
> Cc: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no; marc_schwartz at me.com
> Subject: Re: [R] using regular expressions to retrieve a 
> digit-digit-dotstructure from a string
> 
> Hey all,
> 
> Thanks for your help. Your answers solved the problem I 
> posted and that is
> just when I noticed that I misspecified the problem ;) 
> My problem is to separate a German texts by sentences. Unfortunately I
> haven't found an R package doing this kind of text separation 
> in German, so
> I try it "manually". 
> 
> Just using the dot as separator fails in occasions like:
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
> 
> Here I want the algorithm to separate the string only at the 
> positions where
> the dot is not preceded by a digit. The R-snippets posted 
> pick out "1." and
> "19."
> 
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
> > gregexpr('(?<=[0-9])[.]',txt, perl=T)
> [[1]]
> [1] 14 49
> attr(,"match.length")
> [1] 1 1
> 
> But I just need it the other way round. So I tried:
> 
> > strsplit(txt, "[[:alpha:]]\\." , perl=T)
> [[1]]
> [1] "One January 1. I saw Ric"       " He was born in the 19. centur"
> 
> But this erases the last letter from each sentence. Does 
> someone know a
> solution?

In S+ strsplit() has an argument called subpattern that lets you
specify which parenthesized part of the regular expression
to use as the split point.  It is the akin to the \\<digit> used in the
replacement argument of sub and gsub.  E.g., to split the string
at the sequence of spaces after a period, but not after period preceded
by a digit do:
   > txt <- "One January 1. I saw Rick. He was born in the 19. century."
   > strsplit(txt, "[^[:digit:]]\\.([[:space:]]+)", subpattern=1)
   [[1]]:
   [1] "One January 1. I saw Rick."      "He was born in the 19. century."
subpattern=0, the default, means text matched by the entire regular
expression.  regexpr has the same argument.  Would such an argument
solve your problem?

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 
 
> TIA
> Mark
> 
> -------------------------------
> 
> Mark Heckmann
> + 49 (0) 421 - 1614618
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
> 
> 
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
> Gesendet: Dienstag, 9. Juni 2009 12:48
> An: Wacek Kusnierczyk
> Cc: Mark Heckmann; r-help at r-project.org
> Betreff: Re: [R] using regular expressions to retrieve a 
> digit-digit-dot
> structure from a string
> 
> On Tue, Jun 9, 2009 at 3:04 AM, Wacek
> Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> > Gabor Grothendieck wrote:
> >> On Mon, Jun 8, 2009 at 7:18 PM, Wacek
> >> Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> >>
> >>> Gabor Grothendieck wrote:
> >>>
> >>>> Try this.  See ?regex for more.
> >>>>
> >>>>
> >>>>
> >>>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
> >>>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
> >>>>>
> >>>>>
> >>>> [1] 24
> >>>> attr(,"match.length")
> >>>> [1] 1
> >>>>
> >>>>
> >>> yes, but
> >>>
> >>>    gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
> >>>    # 2 5 9
> >>>
> >>
> >> Yes, it should be:
> >>
> >>
> >>> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRU
> E)
> >>>
> >> [[1]]
> >> [1] 5 9
> >> attr(,"match.length")
> >> [1] 1 1
> >>
> >> which displays the position of every dot that is preceded
> >> immediately by a digit.  Or just replace gregexpr with regexpr
> >> if its intended that it match only one.
> >>
> >
> > i guess what was needed was something like
> >
> >    gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
> >    # 5
> >
> > which won't work, however, because pcre does not support 
> variable-width
> > lookbehinds.
> 
> No, what I wrote was what I intended.   I don't think we are
> discussing the answer
> at this point but just the interpretation of what was intended.  You
> are including
> the word boundary in the question and I am not.  I think its 
> also possible
> that
> regexpr is what is wanted, not gregexpr, but at this point I think the
> poster has
> enough answers that he can complete it himself by considering 
> what he wants
> and using one of ours or a suitable modification.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>