[R] parse an HTML page with verbose error message (using XML)

Duncan Temple Lang duncan at wald.ucdavis.edu
Fri Mar 12 05:43:40 CET 2010


Hi Yihui

  It took me a moment to see the error message as the latest
development version of the XML package suppresses/hides them by default
for htmlParse().

 You can provide your own function via the error parameter.
If you just want to see more detailed error messages on the console
you can use a function like the following


fullInfoErrorHandler =
function(msg, code, domain, line, col, level, file)
{
   # level tells how significant the error is
   #   These are 0, 1, 2, 3 for WARNING, ERROR, FATAL
   #     meaning simple warning, recoverable error and fatal/unrecoverable error.
   #  See XML:::xmlErrorLevel
   #
   # code is an error code, See the values in XML:::xmlParserErrors
   #  XML_HTML_UNKNOWN_TAG, XML_ERR_DOCUMENT_EMPTY
   #
   # domain tells what part of the library raised this error.
   #  See XML:::xmlErrorDomain

  codeMsg = switch(level, "warning", "recoverable error", "fatal error")
  cat("There was a", codeMsg, "in the", file, "at line", line, "column",
         col, "\n", msg, "\n")
}

doc = htmlParse("~/htmlErrors.html", error = fullInfoErrorHandler)

And of course you can mimic xmlErrorCumulator() to form a closure that
collects the different details of each message into an object.  If you
look in the error.R and xmlErrorEnums.R files within the R code of the
XML package, you'll find some additional functions that give us further
support for working with errors in the XML/HTML parsers.

 Best,
   D.

Yihui Xie wrote:
> I'm using the function htmlParse() in the XML package, and I need a
> little bit help on error handling while parsing an HTML page. So far I
> can use either the default way:
> 
> # error = xmlErrorCumulator(), by default
> library(XML)
> doc = htmlParse("http://www.public.iastate.edu/~pdixon/stat500/")
> # the error message is:
> # htmlParseStartTag: invalid element name
> 
> or the tryCatch() approach:
> 
> # error = NULL, errors to be caught by tryCatch()
> tryCatch({
>     doc = htmlParse("http://www.public.iastate.edu/~pdixon/stat500/",
>         error = NULL)
> }, XMLError = function(e) {
>     cat("There was an error in the XML at line", e$line, "column",
>         e$col, "\n", e$message, "\n")
> })
> # verbose error message as:
> # There was an error in the XML at line 90 column 2
> # htmlParseStartTag: invalid element name
> 
> I wish to get the verbose error messages without really stopping the
> parsing process; the first approach cannot return detailed error
> messages, while the second one will stop the program...
> 
> Thanks!
> 
> Regards,
> Yihui
> --
> Yihui Xie <xieyihui at gmail.com>
> Phone: 515-294-6609 Web: http://yihui.name
> Department of Statistics, Iowa State University
> 3211 Snedecor Hall, Ames, IA
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list