[R] xmlToDataFrame very slow
Duncan Temple Lang
dtemplelang at ucdavis.edu
Wed Jul 31 19:53:39 CEST 2013
xmlToDataFrame() is very generic and so doesn't know anything
about the particulars of the XML it is processing. If you know
something about the structure of the XML, you should be able to leverage that
xmlToDataFrame is also not optimized as it is just a convenience routine for people who want to work with
XML without much effort.
If you send me the file and the code you are using to read the file, I'll take a
look at it.
On 7/30/13 11:10 AM, Stavros Macrakis wrote:
> I have a modest-size XML file (52MB) in a format suited to xmlToDataFrame (package XML).
> I have successfully read it into R by splitting the file 10 ways then running xmlToDataFrame on each part, then
> rbind.fill (package plyr) on the result. This takes about 530 s total, and results in a data.frame with 71k rows and
> object.size of 21MB.
> But trying to run xmlToDataFrame on the whole file takes forever (> 10000 s so far). xmlParse of this file takes only 0.8 s.
> I tried running xmlToDataFrame on the first 10% of the file, then the first 10% repeated twice, then three times (with
> the outer tags adjusted of course). Timings:
> 1 copy: 111 s = 111 per copy
> 2 copy: 311 s = 155 " "
> 3 copy: 626 s = 209 " "
> The runtime is superlinear. What is going on here? Is there a better approach?
More information about the R-help