[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes

Janko Thyson janko.thyson.rstuff at googlemail.com
Wed Jan 26 20:34:02 CET 2011

Dear list,

I'm tackling an empiric research problem that requires me to address a whole
bunch of conceptual and/or technical details at the same time which cuts
time short for all the nitty-gritty details of the "components" involved.
Having said this, I'm lacking the time at the moment to deeply dive into
parallel computing and HTTP requests via RCurl and I hope you can help me
out with one or two imminent issues of my crawler/scraper:

Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub,
multiHandle=my.multi.handle)' within an lapply()-construct in order to read
chunks of deterministically composed URLs from a host. There are courtesy
time delays implemented between the individual http requests (5 times the
time the last request from this host took) so that I'm not clogging the
host. I'm causing about 15 minutes of traffic per day. The problem is, that
'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why
so. I also can't reproduce the error as it's totally erratic. 

I tried to put the function inside a try() or tryCatch() construct to no
avail. Also, I've experimented with a couple of timeout options of Curl, but
honestly didn't really understand all the implications. None worked so far.
It simply seems that upon an error 'getURIAsynchronous()' simply does not
give control back to the R process. Additionally, due to a lack of profound
knowledge in parallel computing, the program is scripted to run a bunch of R
processes independently. "Communication" between them takes place via
variables they read from and write to disc in order to have some sort of
"shared environment" (horrible, I know ;-)). 

So here are my specific questions:
1) Is it possible to catch connection or timeout errors in RCurl functions
that allow me to implement my customized error handling? If so, could you
guide me to some examples, please?
2) Can I somehow identify "frozen" Rterm or Rscript processes (e.g. via
using Sys.getpid()?) in order to shut them down and reinitialize them? 

You'll find my session info below.

Thanks for any hints or advice! 

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)

[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] tcltk     tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] RCurl_1.5-0.1    bitops_1.0-4.1   XML_3.2-0.2      RMySQL_0.7-5    
 [5] filehash_2.1-1   hash_2.0.1       timeDate_2130.91 RODBC_1.3-2     
 [9] MiscPsycho_1.6   statmod_1.4.8    debug_1.2.4      mvbutils_2.5.4  
[13] DBI_0.2-5        cwhmisc_2.1      lattice_0.19-13 

loaded via a namespace (and not attached):
[1] grid_2.12.1

