[R] RCurl unable to download a particular web page -- what is so special about this web page?

Duncan Temple Lang duncan at wald.ucdavis.edu
Tue Jan 27 14:52:33 CET 2009



clair.crossupton at googlemail.com wrote:
> Thank you Duncan.
> 
> I remember seeing in your documentation that you have used this
> 'verbose=TRUE' argument in functions before when trying to see what is
> going on. This is good. However, I have not been able to get it to
> work for me. Does the output appear in R or do you use some other
> external window (i.e. MS DOS window?)?
> 

The libcurl code typically defaults to print on the console.
So on the Windows GUI, this will not show up. Using
a shell (MS DOS window or Unix-like shell) should
should cause the output to be displayed.

A more general way however is to use the debugfunction
option.

d = debugGatherer()

getURL("http://uk.youtube.com",
         debugfunction = d$update, verbose = TRUE)

When this completes, use

  d$value()

and you have the entire contents that would be displayed on the console.


  D.



>> library(RCurl)
>> my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2'
>> getURL(my.url, verbose = TRUE)
> [1] ""
> 
> 
> I am having a problem with a new webpage (http://uk.youtube.com/) but
> if i can get this verbose to work, then i think i will be able to
> google the right action to take based on the information it gives.
> 
> Many thanks for your time,
> C.C.
> 
> 
> On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
>> clair.crossup... at googlemail.com wrote:
>>> Dear R-help,
>>> There seems to be a web page I am unable to download using RCurl. I
>>> don't understand why it won't download:
>>>> library(RCurl)
>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
>>>> getURL(my.url)
>>> [1] ""
>>   I like the irony that RCurl seems to have difficulties downloading an
>> article about R.  Good thing it is just a matter of additional arguments
>> to getURL() or it would be bad news.
>>
>> The followlocation parameter defaults to FALSE, so
>>
>>    getURL(my.url, followlocation = TRUE)
>>
>> gets what you want.
>>
>> The way I found this  is
>>
>>   getURL(my.url, verbose = TRUE)
>>
>> and take a look at the information being sent from R
>> and received by R from the server.
>>
>> This gives
>>
>> * About to connect() towww.nytimes.comport 80 (#0)
>> *   Trying 199.239.136.200... * connected
>> * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
>>  > GET /2009/01/07/technology/business-computing/07program.html?_r=2
>> HTTP/1.1
>> Host:www.nytimes.com
>> Accept: */*
>>
>> < HTTP/1.1 301 Moved Permanently
>> < Server: Sun-ONE-Web-Server/6.1
>> < Date: Mon, 26 Jan 2009 16:10:51 GMT
>> < Content-length: 0
>> < Content-type: text/html
>> < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
>> <
>>
>> And the 301 is the critical thing here.
>>
>>   D.
>>
>>
>>
>>> Other web pages are ok to download but this is the first time I have
>>> been unable to download a web page using the very nice RCurl package.
>>> While i can download the webpage using the RDCOMClient, i would like
>>> to understand why it doesn't work as above please?
>>>> library(RDCOMClient)
>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
>>>> ie <- COMCreate("InternetExplorer.Application")
>>>> txt <- list()
>>>> ie$Navigate(my.url)
>>> NULL
>>>> while(ie[["Busy"]]) Sys.sleep(1)
>>>> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
>>>> txt
>>> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
>>> 07program.html?_r=2`
>>> [1] "Skip to article Try Electronic Edition Log ...
>>> Many thanks for your time,
>>> C.C
>>> Windows Vista, running with administrator privileges.
>>>> sessionInfo()
>>> R version 2.8.1 (2008-12-22)
>>> i386-pc-mingw32
>>> locale:
>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
>>> 1252;LC_MONETARY=English_United Kingdom.
>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods
>>> base
>>> other attached packages:
>>> [1] RDCOMClient_0.92-0 RCurl_0.94-0
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.8.1
>>> ______________________________________________
>>> R-h... at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list