[R] Need some help with regular expression

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Thu Dec 15 19:55:11 CET 2016


Actually, the issue of mail formatting is discussed in the Posting Guide, but some key points are:

1) Only a very few types of file attachments are allowed.

2) email processed through the mailing list automatically strips HTML, which often leaves us looking at something full of wacko characters and with no formatting. As the PG says... this is a plain text mailing list... don't send HTML email to it. Fortunately R is plain text, so if you focus on short, reproducible examples then your meaning will come through just fine. 
-- 
Sent from my phone. Please excuse my brevity.

On December 15, 2016 8:46:55 AM PST, Steven Nagy <nstefi at gmail.com> wrote:
>I tried to send this email, but it didn't go through. I guess pictures
>are
>not allowed to send through HTML formatted emails?
>I'm re-sending it again without the picture, just comment there instead
>as
>placeholder.
>
>Thanks,
>Steven
>
>
>From: Steven Nagy [mailto:nstefi at gmail.com] 
>Sent: Monday, December 12, 2016 10:50 PM
>To: 'Bert Gunter' <bgunter.4567 at gmail.com>
>Cc: 'R-help' <r-help at r-project.org>
>Subject: RE: [R] Need some help with regular expression
>
>Hi Bert and all,
>
>Sorry I was too busy at work and didn't have much time to continue this
>until now.
>So I studied "?regexp" and I can understand your regular expression
>now:
>sub(".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*","\\1",x)
>
>But I also wanted to split up these results in 2 columns, so your
>previous
>command would give me this result:
>[1] "NMA -> STU" "STU -> REG" "-> STU"
>
>and I wanted to further split them up to show this:
>From	To
>NMA	STU
>STU	REG
>	STU
>
>I still don’t quite understand the backreferences, and how could I have
>2
>backreferences, one for the left side of the “->” sign and one for the
>right
>side?
>
>So it seems like I need to apply the “sub” function twice, similar how
>I
>used the “strapply” function twice in my original post:
>strapply(strapply(a, "(file://w+ -> STU|STU -> file://w+)", c, backref
>= -1,
>perl = TRUE), "(file://w+) -> (file://w+)", c, backref = -2, perl =
>TRUE)
>
>or maybe there would be a more simple way of using only 1 “sub”
>function and
>2 backreferences?
>
>Also I’m not sure what do I do after I get the data? How could I
>represent
>the member type changes graphically? We need to analyze the behavior of
>switching from STU to another type or from another type to STU.
>Google Analytics has a nice chart under Behavior Flow, or Users Flow,
>and it
>looks like this:
><here was my picture from Google Analytics - it's from Behavior Flow or
>Users Flow showing flows from one category to another one and further
>to
>another one>
>
>
>
>Is there any graphical representation in R that is similar to this?
>
>Thanks a lot,
>Steven
>
>-----Original Message-----
>From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Bert
>Gunter
>Sent: Sunday, November 20, 2016 10:05 PM
>To: Aliz Csonka <mailto:lyzae.ro at gmail.com>
>Cc: R-help <mailto:r-help at r-project.org>
>Subject: Re: [R] Need some help with regular expression
>
>Although others may respond, I think you will do much better studying
>?regexp, which will answer all your questions. I believe the effort you
>will
>make figuring it out will pay dividends for your future R/regular
>expression
>usage that you cannot gain from my direct explanation.
>
>Good luck.
>
>Best,
>Bert
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and
>sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>On Sun, Nov 20, 2016 at 6:40 PM, Steven Nagy <mailto:nstefi at gmail.com>
>wrote:
>> Thanks a lot Bert. That's amazing. I am very new to both R and
>regular 
>> expressions. I don't really understand the regular expression that
>you 
>> used below.
>> And looks like I don't even need any special library, like the 
>> "gsubfn" for the strapply function.
>> I was trying to use the regexr.com website to analyze your regular 
>> expression, but it doesn't seem to match any text there.
>> Can you explain me the regular expression that you used?
>> ".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*"
>> So the dot in the front means any character and the star after that 
>> means that it can repeat 0 or more times, right?
>> Then followed by a colon character ":" and a space, and what is the 
>> next star after that? It means that the sequence before that again
>can 
>> repeat 0 or more times?
>> And what are the double square brackets?
>> Is ":alnum:" specific to R? I don't think "regexr.com" understands 
>> that. Or maybe that site is for regular expressions in Javascript,
>and 
>> the syntax is different in R?
>>
>> Thank you,
>> Steven
>>
>> -----Original Message-----
>> From: Bert Gunter [mailto:bgunter.4567 at gmail.com]
>> Sent: Sunday, November 20, 2016 2:15 PM
>> To: Steven Nagy <mailto:nstefi at gmail.com>
>> Cc: R-help <mailto:r-help at r-project.org>
>> Subject: Re: [R] Need some help with regular expression
>>
>> If I understand you correctly, I think you are making it more complex
>
>> than necessary. Using your example (thanks!!), the following should 
>> get you
>> started:
>>
>>
>>> x<- c("Name.MEMBER_TYPE: NMA -> STU ; CATEGORY:  -> 1 ; CITY:
>>> MISSISSAUGA -> Mississauga ; ZIP: L5N1H9 -> L5N 1H9 ; COUNTRY: CAN
>-> 
>>> ; MEMBER_STATUS:  -> N", "Name.MEMBER_TYPE: STU -> REG ; CATEGORY: 1
>>> ->","Name.MEMBER_TYPE: -> STU")
>>>
>>> x
>> [1] "Name.MEMBER_TYPE: NMA -> STU ; CATEGORY:  -> 1 ; CITY:
>> MISSISSAUGA -> Mississauga ; ZIP: L5N1H9 -> L5N 1H9 ; COUNTRY: CAN ->
>
>> ;
>> MEMBER_STATUS:  -> N"
>>
>> [2] "Name.MEMBER_TYPE: STU -> REG ; CATEGORY: 1 ->"
>> [3] "Name.MEMBER_TYPE: -> STU"
>>>
>>> sub(".*: *([[:alnum:]]* *-> *STU|STU *->
>*[[:alnum:]]*).*","file://1",x)
>> [1] "NMA -> STU" "STU -> REG" "-> STU"
>>
>>
>> I am sure that you can get things to the form you desire in one go 
>> with some fiddling of the above, but it was easier for me to write
>the 
>> regex to pick out the pieces you wanted and leave the rest to you.
>> Others may have slicker ways to do it, of course.
>>
>> HTH
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
>along 
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Sat, Nov 19, 2016 at 8:06 PM, Steven Nagy
><mailto:nstefi at gmail.com>
>wrote:
>>> I tried out a regular expression on this website:
>>>
>>> http://regexr.com/3en1m
>>>
>>>
>>>
>>> So the input text is:
>>>
>>> "Name.MEMBER_TYPE:  -> STU"
>>>
>>>
>>>
>>> The regular expression is: ((?:\w+|\s) -> STU|STU -> (?:\w+|\s))
>>>
>>> And it returns:
>>>
>>> "  -> STU"
>>>
>>>
>>>
>>> but when I use in R, it doesn't return the same result:
>>>
>>> strapply(c, "((?:\\w+|\\s) -> STU|STU -> (?:\\w+|\\s))", c, backref
>= 
>>> -1, perl = TRUE)
>>>
>>> returns:
>>> "Name.MEMBER_TYPE: -> STU"
>>>
>>>
>>>
>>>
>>>
>>> Here is what I was trying to do:
>>>
>>>
>>>
>>> I need to extract some values from a log table, and I created a 
>>> regular expression that helps me with that.
>>>
>>> The log table has cells with values like:
>>>
>>> a = "Name.MEMBER_TYPE: NMA -> STU ; CATEGORY:  -> 1 ; CITY:
>>> MISSISSAUGA -> Mississauga ; ZIP: L5N1H9 -> L5N 1H9 ; COUNTRY: CAN
>-> 
>>> ; MEMBER_STATUS:  -> N"
>>>
>>> or
>>> b = "Name.MEMBER_TYPE: STU -> REG ; CATEGORY: 1 ->"
>>>
>>> so I needed to extract the values that a STU member type is changing
>
>>> from and to, so I needed NMA, STU in the 1st case or STU, REG in the
>
>>> 2nd
>> case.
>>>
>>> I came up with this expression which worked in both cases:
>>>
>>> strapply(strapply(a, "(file://w+ -> STU|STU -> file://w+)", c,
>backref =
>-1, 
>>> perl = TRUE), "(file://w+) -> (file://w+)", c, backref = -2, perl =
>TRUE)
>>>
>>>
>>>
>>> But I had a 3rd case when the source member type was blank:
>>>
>>> c = "Name.MEMBER_TYPE: -> STU"
>>>
>>> and in that case it returned an error:
>>>
>>> strapply(strapply(c, "(file://w+ -> STU|STU -> file://w+)", c,
>backref =
>-1, 
>>> perl = TRUE), "(file://w+) -> (file://w+)", c, backref = -2, perl =
>TRUE)
>>>
>>> Error: is.character(x) is not TRUE
>>>
>>>
>>>
>>> I found that the error is because this returns NULL:
>>>
>>> strapply(c, "(file://w+ -> STU|STU -> file://w+)", c, backref = -1,
>perl
>= 
>>> TRUE)
>>>
>>>
>>>
>>>
>>>
>>> So I tried to modify the regular expression to match any word or 
>>> blank
>>> space:
>>>
>>> strapply(c, "((?:\\w+|\\s) -> STU|STU -> (?:\\w+|\\s))", c, backref
>= 
>>> -1, perl = TRUE)
>>>
>>>
>>>
>>> but this returned me the whole value of "c":
>>>
>>> "Name.MEMBER_TYPE:  -> STU"
>>>
>>> and I only needed "  -> STU" as it shows on the website regxr.com
>>>
>>>
>>>
>>> Is the result wrong on the regxr.com website or strapply returns the
>
>>> wrong result?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Steven
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>see 
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>______________________________________________
>mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list