[R] R hangs after htmlTreeParse

Duncan Temple Lang duncan at wald.ucdavis.edu
Thu Aug 25 23:18:22 CEST 2011


Hi Simon

 I tried this on OS X, Linux and Windows and it works without any problem.
So there must be some strange interaction with your configuration.
So below are some things to try in order to get more information about the problem.

It would be more informative to give us the explicit version information
about the packages, e.g. use sessionInfo().  Details are very important
in cases like this.

In addition the versions of the packages, it is also important to identify the
version of libxml via the  libxmlVersion() function.
(Mine is 2.07.03. Yours may still be in the 2.6.16 region. I can't recall the defaults on OS X 10.6.)

Are you doing this in a GUI or at the command-line? If the former, try the
latter, i.e. run the commands in a terminal and see if that changes anything,
e.g. if any characters are causing problems.

Since you are seeing some of the HTML document appear on the console, the problem is
in the implicit call to print when after the call to htmlTreeParse().
The problem is likely to be delayed if you assign the result of htmlTreeParse()
to a variable and do not induce this call to print().
Then you can explore the tree and see if it is corrupted in some way.

Furthermore, you might use htmlParse(). It returns the tree in a very different
form, but which can be manipulated with the same R functions, and also XPath queries.
I "very rarely" (i.e. never) use htmlTreeParse() anymore.

 D.



On 8/25/11 8:41 AM, Simon Kiss wrote:
> Dear colleagues,
> I'm trying to parse the html content from this webpage:
> http://timesofindia.indiatimes.com/searchresult.cms?sortorder=score&searchtype=2&maxrow=10&startdate=2001-01-01&enddate=2011-08-25&article=2&pagenumber=1&isphrase=no&query=IIM&searchfield=&section=&kdaterange=30&date1mm=01&date1dd=01&date1yyyy=2001&date2mm=08&date2dd=25&date2yyyy=2011
> 
> Using the following code
> library(RCurl)
> library(XML)
> myurl<-c("http://timesofindia.indiatimes.com/searchresult.cms?sortorder=score&searchtype=2&maxrow=10&startdate=2001-01-01&enddate=2011-08-25&article=2&pagenumber=1&isphrase=no&query=IIM&searchfield=&section=&kdaterange=30&date1mm=01&date1dd=01&date1yyyy=2001&date2mm=08&date2dd=25&date2yyyy=2011")
> 
> .x<-getURL(myurl)
> htmlTreeParse(.x, asText=T)
> 
> This prints approximately 15 lines of the output from the html document and then mysteriously stops. The command line prompt does not reappear and force quit is the only option. 
> I'm running R 2.13 on Mac os 10.6 and the latest versions of XML and RCURL are installed.
> Yours, Simon Kiss
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list