[R] regex - optional part isn't considered in replacement with gsub

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Mon Aug 28 07:37:33 CEST 2017


Omar, please remember that this is R-help,  not R-do-my-work-for-me... you have already been given several hints as to how you can refine your patterns yourself. These skills are key to real world data science, so you need to work at being able to take hints and expand on them if you are to be successful in these kinds of tasks. Also, if you cannot learn to make reproducible examples ([1][2][3]) to illustrate your problems then we have about reached the limit of our ability to help you.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

[2] http://adv-r.had.co.nz/Reproducibility.html

[3] https://cran.r-project.org/web/packages/reprex/index.html (read the vignette)
-- 
Sent from my phone. Please excuse my brevity.

On August 27, 2017 10:15:25 PM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>"Please, consider that some SKUs have "-"
>in the middle, for example: "PG-9021".
>
>Then you need to include these in the list of patterns you gave. Try it
>again -- this time with a **complete** list.
>
>-- Bert
>
>
>
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and
>sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>On Sun, Aug 27, 2017 at 10:01 PM, Omar André Gonzáles Díaz <
>oma.gonzales at gmail.com> wrote:
>
>> Hi Bert,
>>
>> I would say that the delimitir is "blank", every other row with "-"
>as
>> delimiter should be ignore. Please, consider that some SKUs have "-"
>> in the middle, for example: "PG-9021".
>>
>> As for the <end of character string>, it's now corrected. There
>> shouldn't be any case of this (if there are, just ignore them).
>>
>> I've tried to apply different gsub operations to capture different
>> cases, for example:
>>
>> ecommerce$sku <-
>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
>> ecommerce$producto)
>>
>>
>> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>> "\\2", ecommerce$sku)
>>
>>
>> ecommerce$sku <-
>> gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)", "\\2",
>> ecommerce$sku)
>>
>> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)",
>> "\\2", ecommerce$sku)
>>
>>
>> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)", "\\2",
>> ecommerce$sku)
>>
>>
>> I don't know if that is the best approache, but I couldn't capture
>the
>> case in the initial question. And as I've said, the important thing
>is
>> to capture as many SKUs as possibe.
>>
>> Thank you for your time, Sir.
>>
>>
>>
>>
>> 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
>> > Omar:
>> >
>> > I don't think this can work. For example number-letter patterns 4),
>> > 5), and 6) would all be matched by pattern 6).
>> >
>> > As Jeff indicated, you need to provide the delimiters -- what
>> > characters come before and after the SKU patterns -- to be able to
>> > recognize them. In a quick look at the text file you attached, the
>> > delimiters appeared to be either "-" or " " (blank) and perhaps
><end
>> > of character string>. If that is correct or if you can tell us how
>to
>> > make it correct, then it's straightforward to proceed. Otherwise, I
>am
>> > unable to help. Maybe someone else can.
>> >
>> > Cheers,
>> > Bert
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Aug 27, 2017 at 11:47 AM, Omar André Gonzáles Díaz
>> > <oma.gonzales at gmail.com> wrote:
>> >> Hi Jeff, Bert, thank you for your input.
>> >>
>> >> I'm attaching a sample of the data, feel free to explore it.
>> >>
>> >> As I said, I need to extract the SKUs of the products (a key that
>> >> identifies every product). Not every producto (row) has a SKU, in
>this
>> >> case "no SKU" should be the output.
>> >>
>> >> I've identify these patterns so far:
>> >>
>> >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter.
>> >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter.
>> >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters.
>> >> 4.-LH5000: 2 letters, 4 numbers.
>> >> 5.-B8500: 1 letters, 4 numbers.
>> >> 6.-E310: 1 letter, 3 numbers.
>> >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters.
>> >>
>> >>
>> >> I think those cover the mayority of skus. So I would appreciate a
>a
>> >> guidence on how to extract all those different patterns.
>> >>
>> >> Relate but not the question asked: The idea is that after
>extracting
>> >> the skus, there should be skus repeted accros the different
>ecommerce.
>> >> Those skus would permit us to compare the products and their
>prices.
>> >>
>> >>
>> >> Thank you in advance.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
>> >>> You may have to provide us more detail on **exactly** the sorts
>of
>> >>> patterns you wish to "capture" -- including exactly what you mean
>by
>> >>> "capture" (what vaue do you wish to return?) -- as the "obvious"
>> >>> answer is probably not sufficient:
>> >>>
>> >>> ## using your example -- thankyou
>> >>>
>> >>>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])
>> >>> [1] "49MU6300"  "LE32S5970"
>> >>>
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>> Bert Gunter
>> >>>
>> >>> "The trouble with having an open mind is that people keep coming
>along
>> >>> and sticking things into it."
>> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
>)
>> >>>
>> >>>
>> >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar André Gonzáles Díaz
>> >>> <oma.gonzales at gmail.com> wrote:
>> >>>> Hello, I need some help with regex.
>> >>>>
>> >>>> I have this to sentences. I need to extract both "49MU6300" and
>> "LE32S5970"
>> >>>> and put them in a new colum "SKU".
>> >>>>
>> >>>> A) SMART TV UHD 49'' CURVO 49MU6300
>> >>>> B) SMART TV HD 32'' LE32S5970
>> >>>>
>> >>>> DataFrame for testing:
>> >>>>
>> >>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD
>49''
>> CURVO
>> >>>> 49MU6300",
>> >>>>                              "SMART TV HD 32'' LE32S5970"))
>> >>>>
>> >>>>
>> >>>> I'm using gsub like this:
>> >>>>
>> >>>> 1.- This would capture A as intended but only "32S5970" from B
>> (missing
>> >>>> "LE").
>> >>>>
>> >>>> ecommerce$sku <-
>gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>> "\\2",
>> >>>> ecommerce$producto)
>> >>>>
>> >>>>
>> >>>> 2.- This would capture "LE32S5970" but not "49MU6300".
>> >>>>
>> >>>> ecommerce$sku <-
>> >>>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>"\\2",
>> >>>> ecommerce$producto)
>> >>>>
>> >>>>
>> >>>> 3.- If I make the 2 first letter optional with:
>> >>>>
>> >>>> ecommerce$sku <-
>> >>>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>"\\2",
>> >>>> ecommerce$producto)
>> >>>>
>> >>>>
>> >>>> "49MU6300" is capture, but again only "32S5970" from B (missing
>"LE").
>> >>>>
>> >>>>
>> >>>> What should I do? How would you approche it?
>> >>>>
>> >>>>         [[alternative HTML version deleted]]
>> >>>>
>> >>>> ______________________________________________
>> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>see
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> >>>> and provide commented, minimal, self-contained, reproducible
>code.
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list