[R] XML package example code?

Duncan Temple Lang duncan at wald.ucdavis.edu
Wed Nov 25 18:25:27 CET 2009



Peng Yu wrote:
> On Wed, Nov 25, 2009 at 12:19 AM, cls59 <chuck at sharpsteen.net> wrote:
>>
>> Peng Yu wrote:
>>> I'm interested in parsing an html page. I should use XML, right? Could
>>> you somebody show me some example code? Is there a tutorial for this
>>> package?
>>>
>> Did you try looking through the help pages for the XML package or browsing
>> the Omegahat website?
>>
>> Look at:
>>
>>  library(XML)
>>  ?htmlTreeParse
>>
>> And the relevant web page for documentation and examples is:
>>
>>  http://www.omegahat.org/RSXML/
> 
> 
> http://www.omegahat.org/RSXML/shortIntro.html
> 
> I'm trying the example on the above webpage. But I'm not sure why I
> got the following error. Would you help to take a look?
> 
> 
> $ Rscript main.R
>> library(XML)
>>
>> download.file('http://www.omegahat.org/RSXML/index.html','index.html')
> trying URL 'http://www.omegahat.org/RSXML/index.html'
> Content type 'text/html; charset=ISO-8859-1' length 3021 bytes
> opened URL
> ==================================================
> downloaded 3021 bytes
> 
>> doc = xmlInternalTreeParse("index.html")


You are trying to parse an HTML document as if it were XML.
But HTML is often not well-formed.  So use htmlParse()
for a more forgiving parser.

Or use the RTidyHTML package (www.omegahat.org/RTidyHTML)
to make the HTML well-formed before passing it to xmlTreeParse()
(aka xmlInternalTreeParse()). That package is an interface to
libtidy.

 D.


> Opening and ending tag mismatch: dd line 68 and dl
> Opening and ending tag mismatch: li line 67 and body
> Opening and ending tag mismatch: dt line 66 and html
> Premature end of data in tag dd line 64
> Premature end of data in tag li line 63
> Premature end of data in tag dt line 62
> Premature end of data in tag dl line 61
> Premature end of data in tag body line 5
> Premature end of data in tag html line 1
> Error: 1: Opening and ending tag mismatch: dd line 68 and dl
> 2: Opening and ending tag mismatch: li line 67 and body
> 3: Opening and ending tag mismatch: dt line 66 and html
> 4: Premature end of data in tag dd line 64
> 5: Premature end of data in tag li line 63
> 6: Premature end of data in tag dt line 62
> 7: Premature end of data in tag dl line 61
> 8: Premature end of data in tag body line 5
> 9: Premature end of data in tag html line 1
> Execution halted
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list