[R] using regular expressions to retrieve a digit-digit-dot structure from a string

Tue Jun 9 16:00:58 CEST 2009

Marc Schwartz wrote:
> On Jun 9, 2009, at 6:44 AM, Mark Heckmann wrote:
>
>> Hey all,
>>
>> Thanks for your help. Your answers solved the problem I posted and
>> that is
>> just when I noticed that I misspecified the problem ;)
>> My problem is to separate a German texts by sentences. Unfortunately I
>> haven't found an R package doing this kind of text separation in
>> German, so
>> I try it "manually".
>>
>> Just using the dot as separator fails in occasions like:
>> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>>
>> Here I want the algorithm to separate the string only at the
>> positions where
>> the dot is not preceded by a digit. The R-snippets posted pick out
>> "1." and
>> "19."
>>
>> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>>> gregexpr('(?<=[0-9])[.]',txt, perl=T)
>> [[1]]
>> [1] 14 49
>> attr(,"match.length")
>> [1] 1 1
>>
>> But I just need it the other way round. So I tried:
>>
>>> strsplit(txt, "[[:alpha:]]\\." , perl=T)
>> [[1]]
>> [1] "One January 1. I saw Ric"       " He was born in the 19. centur"
>>
>> But this erases the last letter from each sentence. Does someone know a
>> solution?

try

    strsplit(txt, '(?<![0-9])[.]', perl=TRUE)

vQ