[R] gsub: replacing a.*a if no occurence of b in .*

Duncan Temple Lang duncan at wald.ucdavis.edu
Sat Feb 24 19:01:19 CET 2007


Ulrich Keller wrote:
> I am trying to read a number of XML files using xmlTreeParse(). Unfortunately,
> some of them are malformed in a way that makes R crash. The problem is that
> closing tags are sometimes repeated like this:
> 
> <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag>
> 
> I want to preprocess the contents of the XML file using gsub() before feeding
> them to xmlTreeParse() to clean them up, but I can't figure out how to do it.
> What I need is something that transforms the example above into:
> 
> <tag>value1</tag><tag>value2</tag><tag>value3</tag>
> 
> Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in ".*".
> 
> Thanks in advance for you ideas,


Instead of using xmlTreeParse() which really expects well-formed XML,
and assuming you cannot have the XML generation mechanism fixed, you might
try to use htmlTreeParse().
While the name suggests it is for HTML, it is really a "relaxed"
XML parser that is capable of handling malformed XML.  This typically
occurs in HTML and hence the name.
Of course, since the XML is malformed, the results will be hard to predict
as it is hard to make sense of "non-sense".


If xmlTreeParse() is actually causing R to exit (i.e. what some people
refer to as crashing), as Jeff (Horner) said, we would like to be able
to stop this. We will need the actual text/file passed to
xmlTreeParse(), version information of operating system, R and the XML
package and any locale information.  However, if by crashing you mean
generates an error, then that is expected on malformed XML inputs.

HTH,
 D.

> 
> Uli
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Duncan Temple Lang                duncan at wald.ucdavis.edu
Department of Statistics          work:  (530) 752-4782
4210 Mathematical Sciences Bldg.  fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-help/attachments/20070224/8edba5e8/attachment.bin 


More information about the R-help mailing list