[R] RCurl unable to download a particular web page -- what is so special about this web page?

clair.crossupton at googlemail.com clair.crossupton at googlemail.com
Tue Jan 27 13:25:25 CET 2009


Thank you Duncan.

I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?

> library(RCurl)
> my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2'
> getURL(my.url, verbose = TRUE)
[1] ""
>


I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.

Many thanks for your time,
C.C.


On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
> clair.crossup... at googlemail.com wrote:
> > Dear R-help,
>
> > There seems to be a web page I am unable to download using RCurl. I
> > don't understand why it won't download:
>
> >> library(RCurl)
> >> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
> >> getURL(my.url)
> > [1] ""
>
>   I like the irony that RCurl seems to have difficulties downloading an
> article about R.  Good thing it is just a matter of additional arguments
> to getURL() or it would be bad news.
>
> The followlocation parameter defaults to FALSE, so
>
>    getURL(my.url, followlocation = TRUE)
>
> gets what you want.
>
> The way I found this  is
>
>   getURL(my.url, verbose = TRUE)
>
> and take a look at the information being sent from R
> and received by R from the server.
>
> This gives
>
> * About to connect() towww.nytimes.comport 80 (#0)
> *   Trying 199.239.136.200... * connected
> * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
>  > GET /2009/01/07/technology/business-computing/07program.html?_r=2
> HTTP/1.1
> Host:www.nytimes.com
> Accept: */*
>
> < HTTP/1.1 301 Moved Permanently
> < Server: Sun-ONE-Web-Server/6.1
> < Date: Mon, 26 Jan 2009 16:10:51 GMT
> < Content-length: 0
> < Content-type: text/html
> < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
> <
>
> And the 301 is the critical thing here.
>
>   D.
>
>
>
> > Other web pages are ok to download but this is the first time I have
> > been unable to download a web page using the very nice RCurl package.
> > While i can download the webpage using the RDCOMClient, i would like
> > to understand why it doesn't work as above please?
>
> >> library(RDCOMClient)
> >> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
> >> ie <- COMCreate("InternetExplorer.Application")
> >> txt <- list()
> >> ie$Navigate(my.url)
> > NULL
> >> while(ie[["Busy"]]) Sys.sleep(1)
> >> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
> >> txt
> > $`http://www.nytimes.com/2009/01/07/technology/business-computing/
> > 07program.html?_r=2`
> > [1] "Skip to article Try Electronic Edition Log ...
>
> > Many thanks for your time,
> > C.C
>
> > Windows Vista, running with administrator privileges.
> >> sessionInfo()
> > R version 2.8.1 (2008-12-22)
> > i386-pc-mingw32
>
> > locale:
> > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> > 1252;LC_MONETARY=English_United Kingdom.
> > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods
> > base
>
> > other attached packages:
> > [1] RDCOMClient_0.92-0 RCurl_0.94-0
>
> > loaded via a namespace (and not attached):
> > [1] tools_2.8.1
>
> > ______________________________________________
> > R-h... at r-project.org mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list