[R] regex - optional part isn't considered in replacement with gsub

Stefan Evert stefanML at collocations.de
Tue Aug 29 18:54:46 CEST 2017


> On 27 Aug 2017, at 18:18, Omar André Gonzáles Díaz <oma.gonzales at gmail.com> wrote:
> 
> 3.- If I make the 2 first letter optional with:
> 
> ecommerce$sku <-
> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
> ecommerce$producto)
> 
> "49MU6300" is capture, but again only "32S5970" from B (missing "LE").

Regular expressions are matched greedily from left to right, i.e. the first (.*) will consume as many characters as possible (including the first two letters because they're optional in the following subexpression).

If you make the first group non-greedy (.*?), this works for me:

	ecommerce$sku <- gsub("(.*?)([a-zA-Z]{0,2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", ecommerce$producto)

But as others have pointed out, you might want to explore more robust approaches (take a look at \\b to match a word boundary, for instance).

Best,
Stefan



More information about the R-help mailing list