[R] RCurl unable to download a particular web page -- what is so special about this web page?

Tue Jan 27 18:53:17 CET 2009

clair.crossupton at googlemail.com wrote:
> Cheers Duncan, that worked great
> 
>> getURL("http://uk.youtube.com", httpheader = c("User-Agent" = "R (2.8.1)"))
> [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"
> \"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\">\n\n\
> [etc]
> 
> May I ask if there was a specific manual you read to learn these
> things please? I do not think i could have worked that one out on my
> own.

Unfortunately, other than reading the HTTP specification,
I don't think there is a comprehensive manual for saying
what should work and what might not.  Much of this is
subject to different levels of strictness and various
policy choices.

This particular one of no User-Agent is a somewhat common
issue. So experience is a big component, but
the libcurl documentation and the mailing
lists are good resources.

It is because of these variations, use of different protocols,
cookies, etc.  that RCurl is necessary when
url() and download.file() don't allow enough customization.

One of the useful "tricks" is to
find a call (be it in R or a command-line utility such as
wget or curl) that does work for a particular URL.
Then use something like verbose/debug options,
or tcpdump/wireshark or several others to observe
the communication that succeeds and then the same
for that call that didn't.  Comparing the differences
is a general way to hone in on the necessary invocation
elements.

  D.

> 
> Thank you again for your time,
> C.C
> 
> On 27 Jan, 16:46, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
>> Some Web servers are strict. In this case, it won't accept
>> a request without being told who is asking, i.e. the User-Agent.
>>
>> If you use
>>
>>   getURL("http://www.youtube.com",
>>            httpheader = c("User-Agent" = "R (2.9.0)")))
>>
>> you should get the contents of the page as expected.
>>
>> (Or with URL uk.youtube.com, etc.)
>>
>>   D.
>>
>>
>>
>> clair.crossup... at googlemail.com wrote:
>>> Thank you. The output i get from that example is below:
>>>> d = debugGatherer()
>>>> getURL("http://uk.youtube.com",
>>> +          debugfunction = d$update, verbose = TRUE )
>>> [1] ""
>>>> d$value()
>>> text
>>> "About to connect() to uk.youtube.com port 80 (#0)\n  Trying
>>> 208.117.236.72... connected\nConnected to uk.youtube.com
>>> (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
>>> left intact\n"
>>> headerIn
>>> "HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
>>> Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
>>> \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
>>> 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
>>> Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
>>> \nCneonction: close\r\n\r\n"
>>> headerOut
>>> "GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n"
>>> dataIn
>>> "0\r\n\r\n"
>>> dataOut
>>> ""
>>> So the critical information from this is the '400 Bad Request'. A
>>> Google search defines this for me as:
>>>     The request could not be understood by the server due to malformed
>>>     syntax. The client SHOULD NOT repeat the request without
>>> modifications.
>>> looking through sort(both listCurlOptions() and
>>> http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
>>> help me this time (unless i missed something). Any advice?
>>> Thank you for your time,
>>> C.C
>>> P.S. I can get the d/l to work if i use:
>>>> toString(readLines("http://www.uk.youtube.com"))
>>> [1] "<html>, \t<head>, \t\t<title>OpenDNS</title>, \t</head>, ,
>>> \t<body id=\"mainbody\" onLoad=\"testforbanner();\" style=\"margin:
>>> 0px;\">, \t\t<script language=\"JavaScript\">, \t\t\tfunction
>>> testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
>>> \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
>>> new Array(16), \t\t\t\tbannersizes[0] = [etc]
>>> On 27 Jan, 13:52, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
>>>> clair.crossup... at googlemail.com wrote:
>>>>> Thank you Duncan.
>>>>> I remember seeing in your documentation that you have used this
>>>>> 'verbose=TRUE' argument in functions before when trying to see what is
>>>>> going on. This is good. However, I have not been able to get it to
>>>>> work for me. Does the output appear in R or do you use some other
>>>>> external window (i.e. MS DOS window?)?
>>>> The libcurl code typically defaults to print on the console.
>>>> So on the Windows GUI, this will not show up. Using
>>>> a shell (MS DOS window or Unix-like shell) should
>>>> should cause the output to be displayed.
>>>> A more general way however is to use the debugfunction
>>>> option.
>>>> d = debugGatherer()
>>>> getURL("http://uk.youtube.com",
>>>>          debugfunction = d$update, verbose = TRUE)
>>>> When this completes, use
>>>>   d$value()
>>>> and you have the entire contents that would be displayed on the console.
>>>>   D.
>>>>>> library(RCurl)
>>>>>> my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
>>>>>> getURL(my.url, verbose = TRUE)
>>>>> [1] ""
>>>>> I am having a problem with a new webpage (http://uk.youtube.com/) but
>>>>> if i can get this verbose to work, then i think i will be able to
>>>>> google the right action to take based on the information it gives.
>>>>> Many thanks for your time,
>>>>> C.C.
>>>>> On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
>>>>>> clair.crossup... at googlemail.com wrote:
>>>>>>> Dear R-help,
>>>>>>> There seems to be a web page I am unable to download using RCurl. I
>>>>>>> don't understand why it won't download:
>>>>>>>> library(RCurl)
>>>>>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
>>>>>>>> getURL(my.url)
>>>>>>> [1] ""
>>>>>>   I like the irony that RCurl seems to have difficulties downloading an
>>>>>> article about R.  Good thing it is just a matter of additional arguments
>>>>>> to getURL() or it would be bad news.
>>>>>> The followlocation parameter defaults to FALSE, so
>>>>>>    getURL(my.url, followlocation = TRUE)
>>>>>> gets what you want.
>>>>>> The way I found this  is
>>>>>>   getURL(my.url, verbose = TRUE)
>>>>>> and take a look at the information being sent from R
>>>>>> and received by R from the server.
>>>>>> This gives
>>>>>> * About to connect() towww.nytimes.comport80(#0)
>>>>>> *   Trying 199.239.136.200... * connected
>>>>>> * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
>>>>>>  > GET /2009/01/07/technology/business-computing/07program.html?_r=2
>>>>>> HTTP/1.1
>>>>>> Host:www.nytimes.com
>>>>>> Accept: */*
>>>>>> < HTTP/1.1 301 Moved Permanently
>>>>>> < Server: Sun-ONE-Web-Server/6.1
>>>>>> < Date: Mon, 26 Jan 2009 16:10:51 GMT
>>>>>> < Content-length: 0
>>>>>> < Content-type: text/html
>>>>>> < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
>>>>>> <
>>>>>> And the 301 is the critical thing here.
>>>>>>   D.
>>>>>>> Other web pages are ok to download but this is the first time I have
>>>>>>> been unable to download a web page using the very nice RCurl package.
>>>>>>> While i can download the webpage using the RDCOMClient, i would like
>>>>>>> to understand why it doesn't work as above please?
>>>>>>>> library(RDCOMClient)
>>>>>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
>>>>>>>> ie <- COMCreate("InternetExplorer.Application")
>>>>>>>> txt <- list()
>>>>>>>> ie$Navigate(my.url)
>>>>>>> NULL
>>>>>>>> while(ie[["Busy"]]) Sys.sleep(1)
>>>>>>>> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
>>>>>>>> txt
>>>>>>> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
>>>>>>> 07program.html?_r=2`
>>>>>>> [1] "Skip to article Try Electronic Edition Log ...
>>>>>>> Many thanks for your time,
>>>>>>> C.C
>>>>>>> Windows Vista, running with administrator privileges.
>>>>>>>> sessionInfo()
>>>>>>> R version 2.8.1 (2008-12-22)
>>>>>>> i386-pc-mingw32
>>>>>>> locale:
>>>>>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
>>>>>>> 1252;LC_MONETARY=English_United Kingdom.
>>>>>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods
>>>>>>> base
>>>>>>> other attached packages:
>>>>>>> [1] RDCOMClient_0.92-0 RCurl_0.94-0
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] tools_2.8.1
>>>>>>> ______________________________________________
>>>>>>> R-h... at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>> ______________________________________________
>>>>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> ______________________________________________
>>>>> R-h... at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> ______________________________________________
>>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-h... at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.