[R] Parse XML

Martin Morgan mtmorgan at fhcrc.org
Tue Jun 10 17:52:27 CEST 2008


Hi Pratt --

ppatel3026 <pratik.patel at us.rothschild.com> writes:

> Could someone provide a link or examples of parsing XML document in R? Few
> specific questions below:

Always helpful to know what software you're using; here's mine

> library(XML)
> sessionInfo()
R version 2.8.0 Under development (unstable) (2008-06-09 r45889) 
x86_64-unknown-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  utils     datasets  grDevices methods   base     

other attached packages:
[1] XML_1.95-2

loaded via a namespace (and not attached):
[1] tools_2.8.0

> For instance I can retrieve specific nodes using this:  
> node <- xpathApply(xml, "//" %+% xtag, xmlValue)
>
> 1) I want to be able to retrieve parent node for this node, how can I do
> this? getParentNode() does not seem to cut it. 

I've found it easier to use xpath and the 'internal' representation

> library(XML)
> f <- system.file("exampleData", "mtcars.xml", package="XML")
> xml <- xmlTreeParse(f, useInternal=TRUE)
> q <- "//record[@id='AMC Javelin']"
> nodes <- xpathApply(xml, q)

nodes is a list of length one. Here's the parent of the first (and
only) element

> parent <- xmlParent(nodes[[1]])

or

> xpathApply(xml, paste(q, "/.."))[[1]]


> 2) How can I retrieve children nodes for a particular node? 

> xmlChildren(parent)

or for parent identified by path pq <- "dataset"

> xpathApply(xml, paste(pq, "/*"))

> 3) How can I create an iterator to iterate through the whole tree?

For true event parsing I think you want xmlEventParse, which traverses
the tree and invokes the argument 'handlers' on each node. 'handlers'
is a named list of  functions, the name either signifying a general
type of position in the tree (e.g.,'startElement') or name of node
(e.g., 'record'). So

> handler <- list(startElement=function(name, atts, ...) {
+     cat("starting", name, "\n")
+ })
> xmlEventParse(f, handler)
starting dataset 
starting variables 
starting variable 
[etc]

The usual 'trick' is to use R's lexical scope to provide a context
where results can be stored, e.g., defining a factory to produce
handlers

handlerFactory <- function() {
    ## 'local' store visible to functions defined inside
    ## handlerFactory
    counts <- new.env(parent=emptyenv())
    ## return value -- list of functions
    list(startElement=function(name, atts, ...) {
        ## lexical scope often requires use of <<- rather than <-
        if (!exists(name, counts))
            counts[[name]] <- 1
        else
            counts[[name]] <- counts[[name]] + 1
    }, getCounts=function() {
        ## for retrieving results
        as.list(counts)
    })
}

Then invoke xmlEventParse with an instance of the
handler. xmlEventParse actually returns the handler, which by the end
of xmlEventParse has 'counts'  modified appropriately. We access the
results by invoking our getCounts function.

> xmlEventParse(f, handlerFactory())$getCounts()
$record
[1] 32

$variable
[1] 11
[etc]

If the use of lexical scope is a bit mysterious, there is a 'bank
account' example in the Introduction to R manual (section 10.7) and a
paper by Ross Ihaka and Robert Gentleman on lexical scope (referenced
at http://www.r-project.org/doc/bib/R-other.html) that might help.

I don't usually use event parsing, so the above may not be accurate.

Martin

> Thank you,
> Pratt
> -- 
> View this message in context: http://www.nabble.com/Parse-XML-tp17757373p17757373.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list