[R] RCurl unable to download a particular web page -- what is so special about this web page?

Jeffrey Horner jeff.horner at vanderbilt.edu
Mon Jan 26 17:39:51 CET 2009


Duncan Temple Lang wrote:
>
>
> clair.crossupton at googlemail.com wrote:
>> Dear R-help,
>>
>> There seems to be a web page I am unable to download using RCurl. I
>> don't understand why it won't download:
>>
>>> library(RCurl)
>>> my.url <- 
>>> "http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2" 
>>>
>>> getURL(my.url)
>> [1] ""
>>
>>
>
>  I like the irony that RCurl seems to have difficulties downloading an 
> article about R.  Good thing it is just a matter of additional arguments
> to getURL() or it would be bad news.
Don't forget the irony that https is supported in url() and 
download.file() on Windows but not UNIX...

http://tolstoy.newcastle.edu.au/R/e2/devel/07/01/1634.html

Jeff
>
>
> The followlocation parameter defaults to FALSE, so
>
>   getURL(my.url, followlocation = TRUE)
>
> gets what you want.
>
> The way I found this  is
>
>  getURL(my.url, verbose = TRUE)
>
> and take a look at the information being sent from R
> and received by R from the server.
>
> This gives
>
> * About to connect() to www.nytimes.com port 80 (#0)
> *   Trying 199.239.136.200... * connected
> * Connected to www.nytimes.com (199.239.136.200) port 80 (#0)
> > GET /2009/01/07/technology/business-computing/07program.html?_r=2 
> HTTP/1.1
> Host: www.nytimes.com
> Accept: */*
>
> < HTTP/1.1 301 Moved Permanently
> < Server: Sun-ONE-Web-Server/6.1
> < Date: Mon, 26 Jan 2009 16:10:51 GMT
> < Content-length: 0
> < Content-type: text/html
> < Location: 
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html&OQ=_rQ3D3&OP=42fceb38Q2FQ5DuaRQ5D3-z8Q26--Q24JQ5DJCCQ7BQ5DCMQ5DC1Q5DQ24azf@-F-Q2ANQ5DRY8h@a88Q3Dz-dbYQ24h@Q2AQ5DC1bQ26-Q2AQ26Q5BdDfQ24dF 
>
> <
>
> And the 301 is the critical thing here.
>
>  D.
>
>
>> Other web pages are ok to download but this is the first time I have
>> been unable to download a web page using the very nice RCurl package.
>> While i can download the webpage using the RDCOMClient, i would like
>> to understand why it doesn't work as above please?
>>
>>
>>
>>
>>> library(RDCOMClient)
>>> my.url <- 
>>> "http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2" 
>>>
>>> ie <- COMCreate("InternetExplorer.Application")
>>> txt <- list()
>>> ie$Navigate(my.url)
>> NULL
>>> while(ie[["Busy"]]) Sys.sleep(1)
>>> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
>>> txt
>> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
>> 07program.html?_r=2`
>> [1] "Skip to article Try Electronic Edition Log ...
>>
>>
>> Many thanks for your time,
>> C.C
>>
>> Windows Vista, running with administrator privileges.
>>> sessionInfo()
>> R version 2.8.1 (2008-12-22)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
>> 1252;LC_MONETARY=English_United Kingdom.
>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods
>> base
>>
>> other attached packages:
>> [1] RDCOMClient_0.92-0 RCurl_0.94-0
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.8.1
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list