[R] XML htmlTreeParse fails with no obvious error

Nicolas Delhomme nicolas.delhomme at plantphys.umu.se
Fri Jun 8 14:34:07 CEST 2012


Hi all,

Sorry for the rather uninformative subject, but the error I get is not very informative either.

When using the XML and RCurl package to retrieve the content of an html page, htmlTreeParse fails, printing out the beginning of the HTML:

Error in htmlTreeParse(getURL(url)) : 
  File   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
    <head>
      <title>Deutsches Krebsforschungszentrum</title>
      <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1" />
      <meta http-equiv="Content-Style-Type" content="text/css" />
      <meta http-equiv="imagetoolbar" content="no" />
      <meta name="MSSmartTagsPreventParsing" content="true" />
      <meta name="revisit-after" content="5 days" />
      <meta name="language" content="de" />
      <meta lang="de" content="" xml:lang="de" name="keywords">
      <meta lang="de" xml:lang="de" name="description" content="Das Deutsche Krebsforschungszentrum hat die Aufgabe, die Mechanismen der Krebsentstehung systematisch zu erforschen und Risikofaktoren f√ºr Krebserkrankungen zu erfassen. Aus den Ergebnissen dieser grundlegenden Arbeiten sollen neue Ans√

This code reproduces the error:

library(RCurl)
library(XML)
url <- "www.dkfz.de/en/genetics/pages/projects/bioinformatics/Custom_Chip_Definition_File.html"
htmlTreeParse(getURL(url))

The issue seems to originate in htmlTreeParse as getURL alone works and returns the expected content. I checked that it could not be an encoding issue and as far as I can tell it seems not to be.

Moreover, using htmlParse(paste("http://",url,sep="") works. Note that htmlTreeParse(getURL(paste("http://",url,sep=""))) fails too, the "http://" is important only for htmlParse, so that it identifies it as an URL.

This issue is rather new, and as I've been using the same version of XML and RCurl, I suppose it might have to do with some of the content of the website having been updated, but given the error, I can't quite figure out what is raising it.

Although it works on that simple example, using htmlParse is not really a work around, as I need to use additional arguments in the getURL call (such as userpwd), which I can't provide to htmlParse.

Any hints would be greatly appreciated,

Cheers,

Nico

sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] XML_3.9-4      RCurl_1.91-1   bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] tools_2.15.0

---------------------------------------------------------------
Nicolas Delhomme

Nathaniel Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden
---------------------------------------------------------------



More information about the R-help mailing list