[R] using regular expressions to retrieve a digit-digit-dot structure from a string

Tue Jun 9 16:40:49 CEST 2009

Thanks,

Now it works great. I modified it a bit so the sentences will be split by
questionmarks (.?!:), etc. as well.

strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]]

e.g.

> strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]]
[1] "One January 1. I saw Rick?"      "He was born in the 19. century."

-------------------------------

Mark Heckmann
+ 49 (0) 421 - 1614618
www.markheckmann.de
R-Blog: http://ryouready.wordpress.com

-----Ursprüngliche Nachricht-----
Von: Marc Schwartz [mailto:marc_schwartz at me.com] 
Gesendet: Dienstag, 9. Juni 2009 14:17
An: Mark Heckmann
Cc: r-help at r-project.org; 'Gabor Grothendieck';
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Betreff: Re: AW: [R] using regular expressions to retrieve a digit-digit-dot
structure from a string

On Jun 9, 2009, at 6:44 AM, Mark Heckmann wrote:

> Hey all,
>
> Thanks for your help. Your answers solved the problem I posted and  
> that is
> just when I noticed that I misspecified the problem ;)
> My problem is to separate a German texts by sentences. Unfortunately I
> haven't found an R package doing this kind of text separation in  
> German, so
> I try it "manually".
>
> Just using the dot as separator fails in occasions like:
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>
> Here I want the algorithm to separate the string only at the  
> positions where
> the dot is not preceded by a digit. The R-snippets posted pick out  
> "1." and
> "19."
>
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>> gregexpr('(?<=[0-9])[.]',txt, perl=T)
> [[1]]
> [1] 14 49
> attr(,"match.length")
> [1] 1 1
>
> But I just need it the other way round. So I tried:
>
>> strsplit(txt, "[[:alpha:]]\\." , perl=T)
> [[1]]
> [1] "One January 1. I saw Ric"       " He was born in the 19. centur"
>
> But this erases the last letter from each sentence. Does someone  
> know a
> solution?
>
> TIA
> Mark

<snip>

This is one of those rare? times where it might be nice for strsplit()  
to have an option to retain the split regex at the end of each parsed  
segment, rather than removing it.

There may be a better way, but trying to both avoid a loop over vector  
indices and trying to stay with R functions that use .Internal() for  
speed, you may be able to use something like this:

 > strsplit(gsub("([[:alpha:]]\\.)", "\\1*", txt), "\\* *")
[[1]]
[1] "One January 1. I saw Rick."      "He was born in the 19. century."

What I am essentially doing is to add an "*" to the ending of each  
sentence (you can use other characters) such that strsplit() can split  
on that character without affecting the rest of the sentence.  So as  
an intermediate result, you get:

 > gsub("([[:alpha:]]\\.)", "\\1*", txt)
[1] "One January 1. I saw Rick.* He was born in the 19. century.*"

which then makes the strsplit() parsing a bit easier. Since both  
strsplit() and grep() use .Internal()s, hopefully this would still be  
reasonably fast. Note that I have strsplit() split on the "*" possibly  
followed by one or more " ", which is required for mid-line splits.

HTH,

Marc Schwartz