[R] memory leak using XML readHTMLTable

Duncan Temple Lang duncan at wald.ucdavis.edu
Mon Sep 17 18:16:47 CEST 2012


Hi James

  Unfortunately, I am not certain if the "latest version"
of the XML package has the garbage collection activated for the nodes.
It is quite complicated and that feature was turned off in some versions
of the package.  I suggest that you install the version of the package on github

  git at github-omg:omegahat/XML.git

I believe that will handle the garbage collection of nodes, and I'd like
to know if it doesn't.

   Best,
    D.

On 9/16/12 8:30 PM, J Toll wrote:
> Hi,
> 
> I'm using the XML package to scrape data and I'm trying to figure out
> how to eliminate the memory leak I'm currently experiencing.  In the
> searches I've done, it sounds like the existence of the leak is fairly
> well known.  What isn't as clear is exactly how to solve it.  The
> general process I'm using is this:
> 
> require(XML)
> 
> myFunction <- function(URL) {
> 
>   html <- readLines(URL)
> 
>   tables <- readHTMLTable(html, stringsAsFactors = FALSE)
> 
>   myData <- data.frame(Value = tables[[1]][, 2],
>                                 row.names = make.unique(tables[[1]][, 1]),
>                                 stringsAsFactors = FALSE)
> 
>  rm(list = c("html", "tables"))   # here, and
>  free(tables)                          # here, my attempt to solve the
> memory leak
> 
>   return(myData)
> 
> }
> 
> x <- lapply(myURLs, myFunction)
> 
> 
> I've tried using rm() and free() to try to free up the memory each
> time the function is called, but it hasn't worked as far as I can
> tell.  By the time lapply is finished woking through my list of url's,
> I'm swapping about 3GB of memory.
> 
> I've also tried using gc(), but that seems to also have no effect on
> the problem.
> 
> I'm running RStudio 0.96.330 and latest version of XML.
> R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> 
> Any suggestions on how to solve this memory issue?  Thanks.
> 
> 
> James
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>




More information about the R-help mailing list