[R] How to parse XML

Fri May 2 21:57:14 CEST 2008

Martin,

I can't thank you enough for taking the time to help and providing the
detailed examples of how to get started.  Now I know exactly how to
proceed.

Thanks again,

Roger 

-----Original Message-----
From: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
Sent: Friday, May 02, 2008 12:02 PM
To: Bos, Roger
Cc: r-help at r-project.org
Subject: Re: [R] How to parse XML

Hi Roger --

"Bos, Roger" <roger.bos at us.rothschild.com> writes:

> I would like to learn how to parse a mixed text/xml document I 
> downloaded from the sec.gov website (see example below).  I would like

I'm not sure of a more robust way to extract the XML, but from
inspection I wrote

> ftp <-
"ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-02122
1.txt"
> txt <- readLines(ftp)
> xmlInside <- grep("</*XML", txt)
> xmlTxt <- txt[seq(xmlInside[1]+1, xmlInside[2]-1)]

so that xmlTxt contains the part of the message that is XML

> to parse this to get the value for each xml tag and then access it 
> within R, but I don't know much about xml so I don't even know where 
> to

There are several ways to proceed. I personally like the xpath query
language. to do this, one might

> xml <- xmlTreeParse(xmlTxt, useInternal=TRUE) 
> head(unlist(xpathApply(xml, "//*", xmlName)))

[1] "ownershipDocument"     "schemaVersion"         "documentType"

[4] "periodOfReport"        "notSubjectToSection16" "issuer"

xpathApply takes an xml document and performs a query. The query '//*'
says find all nodes mataching any character string (that's the *) that
are located anywhere (that's the //) below the current (in this case
root) node. This gives a list of nodes; xmlName extracts the name of the
node. If I wanted all nodes not subject to section 16 (sounds
ominmous) I'd extract all the nodes (a list0

> node <- xpathApply(xml, "//notSubjectToSection16")

and then do something with them, e.g., look at them

> lapply(node, saveXML)
[[1]]
[1] "<notSubjectToSection16>0</notSubjectToSection16>"

(not so bad, looks like nothing is not subject to section 16, that's a
relief) and extract their value

> lapply(node, xmlValue)

In one step:

> xpathApply(xml, "//notSubjectToSection16", xmlValue)

?xpathApply is a good starting place, as is http://www.w3.org/TR/xpath,
especially

http://www.w3.org/TR/xpath#path-abbrev

Martin

> start debugging the errors I am getting in this example code.  Can 
> anyone help me get started?
>
> Thanks, Roger
>
> ftp <-
> "ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-021
> 22
> 1.txt"
> download.file(url=ftp, destfile="test2.txt")
> xmlTreeParse("test2.txt")
>
>
> **********************************************************************

> * This message is for the named person's use only. It may contain 
> confidential, proprietary or legally privileged information. No right 
> to confidential or privileged treatment of this message is waived or 
> lost by any error in transmission. If you have received this message 
> in error, please immediately notify the sender by e-mail, delete the 
> message and all copies from your system and destroy any hard copies. 
> You must not, directly or indirectly, use, disclose, distribute, print

> or copy any part of this message if you are not the intended 
> recipient.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center 1100
Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

********************************************************************** * 
This message is for the named person's use only. It may 
contain confidential, proprietary or legally privileged 
information. No right to confidential or privileged treatment 
of this message is waived or lost by any error in 
transmission. If you have received this message in error, 
please immediately notify the sender by e-mail, 
delete the message and all copies from your system and destroy 
any hard copies. You must not, directly or indirectly, use, 
disclose, distribute, print or copy any part of this message 
if you are not the intended recipient.