[R] scraping with session cookies

Duncan Temple Lang dtemplelang at ucdavis.edu
Wed Sep 19 17:49:04 CEST 2012


 You don't need to use the  getHTMLFormDescription() and createFunction().
Instead, you can use the postForm() call.  However, getHTMLFormDescription(),
etc. is more general. But you need the very latest version of the package
to deal with degenerate forms that have no inputs (other than button clicks).

 You can get the latest version of the RHTMLForms package
 from github

      git clone git at github.com:omegahat/RHTMLForms.git

 and that has the fixes for handling the degenerate forms with
 no arguments.

   D.

On 9/19/12 7:51 AM, CPV wrote:
> Thank you for your help Duncan,
> 
> I have been trying what you suggested however  I am getting an error when
> trying to create the function fun<- createFunction(forms[[1]])
> it says Error in isHidden I hasDefault :
> operations are possible only for numeric, logical or complex types
> 
> On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang <
> dtemplelang at ucdavis.edu> wrote:
> 
>> Hi ?
>>
>> The key is that you want to use the same curl handle
>> for both the postForm() and for getting the data document.
>>
>> site = u =
>> "
>> http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18
>> "
>>
>> library(RCurl)
>> curl = getCurlHandle(cookiefile = "", verbose = TRUE)
>>
>> postForm(site, disclaimer_action="I Agree")
>>
>> Now we have the cookie in the curl handle so we can use that same curl
>> handle
>> to request the data document:
>>
>> txt = getURLContent(u, curl = curl)
>>
>> Now we can use readHTMLTable() on the local document content:
>>
>> library(XML)
>> tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)
>>
>>
>>
>> Rather than knowing how to post the form, I like to read
>> the form programmatically and generate an R function to do the submission
>> for me. The RHTMLForms package can do this.
>>
>> library(RHTMLForms)
>> forms = getHTMLFormDescription(u, FALSE)
>> fun = createFunction(forms[[1]])
>>
>> Then we can use
>>
>>  fun(.curl = curl)
>>
>> instead of
>>
>>   postForm(site, disclaimer_action="I Agree")
>>
>> This helps to abstract the details of the form.
>>
>>   D.
>>
>> On 9/18/12 5:57 PM, CPV wrote:
>>> Hi, I am starting coding in r and one of the things that i want to do is
>> to
>>> scrape some data from the web.
>>> The problem that I am having is that I cannot get passed the disclaimer
>>> page (which produces a session cookie). I have been able to collect some
>>> ideas and combine them in the code below but I dont get passed the
>>> disclaimer page.
>>> I am trying to agree the disclaimer with the postForm and write the
>> cookie
>>> to a file, but I cannot do it succesfully....
>>> The webpage cookies are written to the file but the value is FALSE... So
>>> any ideas of what I should do or what I am doing wrong with?
>>> Thank you for your help,
>>>
>>> library(RCurl)
>>> library(XML)
>>>
>>> site <- "
>>>
>> http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18
>> "
>>>
>>> postForm(site, disclaimer_action="I Agree")
>>>
>>> cf <- "cookies.txt"
>>>
>>> no_cookie <- function() {
>>>         curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf)
>>>         getURL(site, curl=curlHandle)
>>>
>>>         rm(curlHandle)
>>>         gc()
>>> }
>>>
>>> if ( file.exists(cf) == TRUE ) {
>>>         file.create(cf)
>>>         no_cookie()
>>> }
>>> allTables <- readHTMLTable(site)
>>> allTables
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>




More information about the R-help mailing list