[R] regex - optional part isn't considered in replacement with gsub

Bert Gunter bgunter.4567 at gmail.com
Mon Aug 28 07:15:25 CEST 2017


"Please, consider that some SKUs have "-"
in the middle, for example: "PG-9021".

Then you need to include these in the list of patterns you gave. Try it
again -- this time with a **complete** list.

-- Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sun, Aug 27, 2017 at 10:01 PM, Omar André Gonzáles Díaz <
oma.gonzales at gmail.com> wrote:

> Hi Bert,
>
> I would say that the delimitir is "blank", every other row with "-" as
> delimiter should be ignore. Please, consider that some SKUs have "-"
> in the middle, for example: "PG-9021".
>
> As for the <end of character string>, it's now corrected. There
> shouldn't be any case of this (if there are, just ignore them).
>
> I've tried to apply different gsub operations to capture different
> cases, for example:
>
> ecommerce$sku <-
> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
> ecommerce$producto)
>
>
> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
> "\\2", ecommerce$sku)
>
>
> ecommerce$sku <-
> gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)", "\\2",
> ecommerce$sku)
>
> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)",
> "\\2", ecommerce$sku)
>
>
> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)", "\\2",
> ecommerce$sku)
>
>
> I don't know if that is the best approache, but I couldn't capture the
> case in the initial question. And as I've said, the important thing is
> to capture as many SKUs as possibe.
>
> Thank you for your time, Sir.
>
>
>
>
> 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
> > Omar:
> >
> > I don't think this can work. For example number-letter patterns 4),
> > 5), and 6) would all be matched by pattern 6).
> >
> > As Jeff indicated, you need to provide the delimiters -- what
> > characters come before and after the SKU patterns -- to be able to
> > recognize them. In a quick look at the text file you attached, the
> > delimiters appeared to be either "-" or " " (blank) and perhaps <end
> > of character string>. If that is correct or if you can tell us how to
> > make it correct, then it's straightforward to proceed. Otherwise, I am
> > unable to help. Maybe someone else can.
> >
> > Cheers,
> > Bert
> >
> >
> >
> >
> >
> >
> > On Sun, Aug 27, 2017 at 11:47 AM, Omar André Gonzáles Díaz
> > <oma.gonzales at gmail.com> wrote:
> >> Hi Jeff, Bert, thank you for your input.
> >>
> >> I'm attaching a sample of the data, feel free to explore it.
> >>
> >> As I said, I need to extract the SKUs of the products (a key that
> >> identifies every product). Not every producto (row) has a SKU, in this
> >> case "no SKU" should be the output.
> >>
> >> I've identify these patterns so far:
> >>
> >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter.
> >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter.
> >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters.
> >> 4.-LH5000: 2 letters, 4 numbers.
> >> 5.-B8500: 1 letters, 4 numbers.
> >> 6.-E310: 1 letter, 3 numbers.
> >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters.
> >>
> >>
> >> I think those cover the mayority of skus. So I would appreciate a a
> >> guidence on how to extract all those different patterns.
> >>
> >> Relate but not the question asked: The idea is that after extracting
> >> the skus, there should be skus repeted accros the different ecommerce.
> >> Those skus would permit us to compare the products and their prices.
> >>
> >>
> >> Thank you in advance.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
> >>> You may have to provide us more detail on **exactly** the sorts of
> >>> patterns you wish to "capture" -- including exactly what you mean by
> >>> "capture" (what vaue do you wish to return?) -- as the "obvious"
> >>> answer is probably not sufficient:
> >>>
> >>> ## using your example -- thankyou
> >>>
> >>>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])
> >>> [1] "49MU6300"  "LE32S5970"
> >>>
> >>>
> >>> Cheers,
> >>> Bert
> >>> Bert Gunter
> >>>
> >>> "The trouble with having an open mind is that people keep coming along
> >>> and sticking things into it."
> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>
> >>>
> >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar André Gonzáles Díaz
> >>> <oma.gonzales at gmail.com> wrote:
> >>>> Hello, I need some help with regex.
> >>>>
> >>>> I have this to sentences. I need to extract both "49MU6300" and
> "LE32S5970"
> >>>> and put them in a new colum "SKU".
> >>>>
> >>>> A) SMART TV UHD 49'' CURVO 49MU6300
> >>>> B) SMART TV HD 32'' LE32S5970
> >>>>
> >>>> DataFrame for testing:
> >>>>
> >>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49''
> CURVO
> >>>> 49MU6300",
> >>>>                              "SMART TV HD 32'' LE32S5970"))
> >>>>
> >>>>
> >>>> I'm using gsub like this:
> >>>>
> >>>> 1.- This would capture A as intended but only "32S5970" from B
> (missing
> >>>> "LE").
> >>>>
> >>>> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
> "\\2",
> >>>> ecommerce$producto)
> >>>>
> >>>>
> >>>> 2.- This would capture "LE32S5970" but not "49MU6300".
> >>>>
> >>>> ecommerce$sku <-
> >>>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
> >>>> ecommerce$producto)
> >>>>
> >>>>
> >>>> 3.- If I make the 2 first letter optional with:
> >>>>
> >>>> ecommerce$sku <-
> >>>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
> >>>> ecommerce$producto)
> >>>>
> >>>>
> >>>> "49MU6300" is capture, but again only "32S5970" from B (missing "LE").
> >>>>
> >>>>
> >>>> What should I do? How would you approche it?
> >>>>
> >>>>         [[alternative HTML version deleted]]
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list