[R] Importing huge XML-Files

Alexander Heidrich alexander.heidrich at uni-jena.de
Sat Sep 1 21:34:00 CEST 2007


Dear all,

for my diploma thesis I have to import huge XML-Files into R for  
statistical processing - huge means a size about 33 MB.

I'm using the XML-Package version 1.9

As far as reading the complete file into R via xmlTreeParse doesn't  
work or is too slow, I'm trying to use xmlEventParse but I got  
completely stuck.

I have many different type of nodes

+ <configuration>

- <Data>
  - <dataSets noOfDataSets="50000">
   - <dataSet number="1">
    - <measurements>
     - <measurement number="1">
      - <MRType1>
         <date>21.04.2005</date>
         <time>10:00</time>
         <plotCode>1</plotCode>
         <collarCode />
         <value>2,33</value>
         <depth />
        </MRType1>
       </measurement>
     - <measurement number="2">
      - <MRType1>
         <date>21.04.2005</date>
         <time>10:00</time>
         <plotCode>1</plotCode>
         <collarCode />
         <value>2,33</value>
         <depth />
        </Soilrespirationrate>
       <MRType2>
	...
+ <personData>
+ <siteData>

I only need the measurement/MRType1 nodes - how can I do this?  
Currently I am trying the following code:

xmlEventParse("/input.xml", list(startElement=xtract.startElement,  
text=xtract.text), useTagName=TRUE, addContext = FALSE)

xtract.startElement <- function(name,attr){
	startElement.name <<- c(startElement.name,name)
	}

xtract.text <- function(text) {
	startElement.value <<- c(startElement.value,text)
	}

this only gives me two lists, one with the all node names (even the  
ones I dont need) and one with the values (also together with the  
ones I dont need) but I can't put things together this way.

What I want is:

No. 	Date	Time	Plotcode	collarcode	value	depth
1	...	...	...		...		...	...
2	...	...	...		...		...	...

Any help is really really appreciated. I tried the whole week,  
starting with xmlTreeParse whick works fine for files with 200  
entries but for files with 50000 entries it keeps crashing my core 2  
duo, 2.4 GHz machine.

Thanks so much in advance! If you need any further information, code  
snippets or XML file details please do not hestitate to mail!

Alex



More information about the R-help mailing list