[R] gsub: replacing a.*a if no occurence of b in .*

Gabor Grothendieck ggrothendieck at gmail.com
Sat Feb 24 17:07:37 CET 2007


I assume <tag> is known.

This removes any occurrence </tag>.*</tag> where .* does not
contain <tag> or </tag>.

The regular expression, re, matches </tag>, then does a greedy
match (?U) for anything followed by </tag> but uses a zero
width lookahead subexpression (?=...) for the second </tag>
so that it it can be rematched again.  gsubfn in package
gsubfn is like the usual gsub except that instead of
replacing the match with a string it passes the match
to function f and then replaces the match with the output
of f.  See the gsubfn home page:
  http://code.google.com/p/gsubfn/
and vignette.


library(gsubfn)

text <- paste("<tag>value1</tag><tag>value2</tag>some",
"garbage</tag></tag><tag>value3</tag>")

re <- "</tag>((?U).*(?=</tag>))"
f <- function(x) if (regexpr("<tag>", x) > 0) x else ""

gsubfn(re, f, text, backref = 0, perl = TRUE)


On 2/24/07, Ulrich Keller <ulrich.keller at emacs.lu> wrote:
> I am trying to read a number of XML files using xmlTreeParse(). Unfortunately,
> some of them are malformed in a way that makes R crash. The problem is that
> closing tags are sometimes repeated like this:
>
> <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag>
>
> I want to preprocess the contents of the XML file using gsub() before feeding
> them to xmlTreeParse() to clean them up, but I can't figure out how to do it.
> What I need is something that transforms the example above into:
>
> <tag>value1</tag><tag>value2</tag><tag>value3</tag>
>
> Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in ".*".
>
> Thanks in advance for you ideas,
>
> Uli
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list