[R] Rapache ( was Developing a web crawler )

Matt Shotwell matt at biostatmatt.com
Sun Mar 6 19:51:53 CET 2011


On Sun, 2011-03-06 at 08:06 -0500, Mike Marchywka wrote: 
> 
> 
> 
> 
> 
> ----------------------------------------
> > Date: Thu, 3 Mar 2011 13:04:11 -0600
> > From: Matt.Shotwell at vanderbilt.edu
> > To: r-help at r-project.org
> > Subject: Re: [R] Developing a web crawler / R "webkit" or something similar? [off topic]
> >
> > On 03/03/2011 08:07 AM, Mike Marchywka wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >> Date: Thu, 3 Mar 2011 01:22:44 -0800
> > >> From: antujsrv at gmail.com
> > >> To: r-help at r-project.org
> > >> Subject: [R] Developing a web crawler
> > >>
> > >> Hi,
> > >>
> > >> I wish to develop a web crawler in R. I have been using the functionalities
> > >> available under the RCurl package.
> > >> I am able to extract the html content of the site but i don't know how to go
> > >
> > > In general this can be a big effort but there may be things in
> > > text processing packages you could adapt to execute html and javascript.
> > > However, I guess what I'd be looking for is something like a "webkit"
> > > package or other open source browser with or without an "R" interface.
> > > This actually may be an ideal solution for a lot of things as you get
> > > all the content handlers of at least some browser.
> > >
> > >
> > > Now that you mention it, I wonder if there are browser plugins to handle
> > > "R" content ( I'd have to give this some thought, put a script up as
> > > a web page with mime type "test/R" and have it execute it in R. )
> >
> > There are server-side solutions for this sort of thing. See
> > http://rapache.net/ . Also, there was a string of messages on R-devel
> > some years ago addressing the mime type issue; beginning here:
> > http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't
> > know whether there was a resolution. Some suggestions were text/x-R,
> > text/x-Rd, application/x-RData.
> >
> The rapache demo looks like something I could use right away
> but I haven't looked into the handlers yet. I have installed rapache now
> on my debian system ( still have config issues but I did get apach2 to restart LOL)
> Before I plow into this too far, how would this compare/compete with something
> like a PHP library for Rserve? That is the approach I had been pursuing.
> 
> Thanks. 

Hi Mike, 

If you've built and configured RApache, then the difficult "plowing" is
over :). RApache operates at the top (HTTP) layer of the OSI stack,
whereas Rserve works at the lower transport/network layer. Hence, the
scope of Rserve applications is far more general. Extending Rserve to
operate at the HTTP layer (via PHP) will mean more work.

RApache offers high level functionality, for example, to replace PHP
with R in web pages. No interface code is necessary. Here's a simple
"What's The Time?" webpage using RApache and yarr [1] to handle the
code:

<< setContentType("text/html\n\n") >>
<html>
<head><title>What's The Time?</title></head>
<body><pre><</= cat(format(Sys.time(), usetz=TRUE)) >></pre></body>
</html>

Here's a live version: [2]. Interfacing PHP with Rserve in this context
would be useful if installation of R and/or RApache on the web host were
prohibited. A PHP/Rserve framework might also be useful in other
contexts, for example, to extend PHP applications (e.g. WordPress,
MediaWiki).

Best,
Matt

[1] http://biostatmatt.com/archives/1000
[2] http://biostatmatt.com/yarr/time.yarr

> 
> > -Matt
> >
> > >
> 
>  		 	   		  
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list