[R] Importing huge XML-Files

Martin Morgan mtmorgan at fhcrc.org
Sun Sep 2 21:23:06 CEST 2007


Hi Alex --

Try xmlTreeParse with useInternalNodes=TRUE. This will likely allow
you to read the document in easily, even with limited memory (33MB
does not sound too 'huge').

To extract just the node sets you're interested in, use getNodeSet or
xpathApply. This is easier (with a little extra learning, e.g.,
http://www.w3.org/TR/xpath20/#abbrev) and more memory efficient than
creating a DOM parser. Something like (WARNING: not tested!!)

> doc <- xmlTreeParse("doc.xml", useInternalNode=TRUE)
> res <- getNodeSet(doc, "//MRType1")

will return (pointers to) all the MRType1 nodes. You could then use
sapply and xmlValue, or more directly

> dates <- xpathApply(doc, "//MRType1/date", xmlValue)

and the like to extract specific values.  The examples and help page
for xpathApply are useful to get you going.

Watch out for memory use, and the need to 'free' the original 'doc'
and perhaps other intermediate objects.

Martin

Alexander Heidrich <alexander.heidrich at uni-jena.de> writes:

> Dear all,
>
> for my diploma thesis I have to import huge XML-Files into R for  
> statistical processing - huge means a size about 33 MB.
>
> I'm using the XML-Package version 1.9
>
> As far as reading the complete file into R via xmlTreeParse doesn't  
> work or is too slow, I'm trying to use xmlEventParse but I got  
> completely stuck.
>
> I have many different type of nodes
>
> + <configuration>
>
> - <Data>
>   - <dataSets noOfDataSets="50000">
>    - <dataSet number="1">
>     - <measurements>
>      - <measurement number="1">
>       - <MRType1>
>          <date>21.04.2005</date>
>          <time>10:00</time>
>          <plotCode>1</plotCode>
>          <collarCode />
>          <value>2,33</value>
>          <depth />
>         </MRType1>
>        </measurement>
>      - <measurement number="2">
>       - <MRType1>
>          <date>21.04.2005</date>
>          <time>10:00</time>
>          <plotCode>1</plotCode>
>          <collarCode />
>          <value>2,33</value>
>          <depth />
>         </Soilrespirationrate>
>        <MRType2>
> 	...
> + <personData>
> + <siteData>
>
> I only need the measurement/MRType1 nodes - how can I do this?  
> Currently I am trying the following code:
>
> xmlEventParse("/input.xml", list(startElement=xtract.startElement,  
> text=xtract.text), useTagName=TRUE, addContext = FALSE)
>
> xtract.startElement <- function(name,attr){
> 	startElement.name <<- c(startElement.name,name)
> 	}
>
> xtract.text <- function(text) {
> 	startElement.value <<- c(startElement.value,text)
> 	}
>
> this only gives me two lists, one with the all node names (even the  
> ones I dont need) and one with the values (also together with the  
> ones I dont need) but I can't put things together this way.
>
> What I want is:
>
> No. 	Date	Time	Plotcode	collarcode	value	depth
> 1	...	...	...		...		...	...
> 2	...	...	...		...		...	...
>
> Any help is really really appreciated. I tried the whole week,  
> starting with xmlTreeParse whick works fine for files with 200  
> entries but for files with 50000 entries it keeps crashing my core 2  
> duo, 2.4 GHz machine.
>
> Thanks so much in advance! If you need any further information, code  
> snippets or XML file details please do not hestitate to mail!
>
> Alex
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org



More information about the R-help mailing list