[R] reading tables from url

Duncan Temple Lang duncan at wald.ucdavis.edu
Wed Nov 14 20:22:27 CET 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Chris.

  Indeed, I cannot connect to that URL either.  So I did a bit of
digging and experimentation to find out whether one needed to
pass additional hidden options from the form or whether the problem was
more to do with how we connect.

It turns out that the script associated with NCBI leuks.cgi is being
fussy and wants you tell it the user agent that is performing the
request.  (Why the two behave differently is not clear after a very
brief look, but it is probably not worth pursuing.)

AFAIR, there is no way to tell R to include a UserAgent field in the
header of the request using url(), etc. although it did come up at one
point.

So here is an alternative. Use the RCurl package and this allows you
a great deal of control over the composition of the request and how
to read it back.

 getURL("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi?dump=selected",
         useragent = "curl", verbose = TRUE)

(The verbose is there to show the header of the request being made to
see the UserAgent field.)

One could do the same with sockets directly or use the
httpRequest for simple HTTP requests.

 D.

stubben wrote:
> I'm trying to read some web tables directly into R.  These are both  
> genome sequencing projects (eukaryotes and metagenomes) from NCBI and  
> look very similar;  however, only the first one works.
> 
> http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
> http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi
> 
> I added  ?dump=selected to the end of the url string to get a tab- 
> delimited file (which is what happens if you click the Save button on  
> either page).
> 
>  > options(internet.info=0)
> 
> ## this one works
> 
>  > x1<-url("http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi? 
> dump=selected")
>  > read.delim(x1, skip=1, nrows=5)[,1:3]
> 
>    X...Columns.                     ProjectID Organism.Name
> 1        20303 Acanthamoeba castellanii Neff      Protists
> 2        13657      Acyrthosiphon pisum LSR1       Animals
> 3        12434       Aedes aegypti Liverpool       Animals
> 4        12635 Ajellomyces capsulatus G186AR         Fungi
> 5        12653  Ajellomyces capsulatus G217B         Fungi
> 
> Warning messages:
> 1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection 
> (file, "r")
> 2: -> GET /genomes/leuks.cgi?dump=selected HTTP/1.0
> 
> Host: www.ncbi.nlm.nih.gov
> 
> Pragma: no-cache
> 
> in: open.connection(file, "r")
> 3: <- HTTP/1.1 200 OK in: open.connection(file, "r")
> 4: <- Date: Wed, 14 Nov 2007 18:03:29 GMT in: open.connection(file, "r")
> 5: <- Server: Apache in: open.connection(file, "r")
> 6: <- Content-Disposition: attachment; filename="untitle.txt" in:  
> open.connection(file, "r")
> 7: <- Content-Type: application/force-download in: open.connection 
> (file, "r")
> 8: <- Vary: Accept-Encoding in: open.connection(file, "r")
> 9: <- Connection: close in: open.connection(file, "r")
> 10: Code 200, content-type 'application/force-download' in:  
> open.connection(file, "r")
> 
> 
> ## this one fails to open a connection
> 
>  > x2<-url("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi? 
> dump=selected")
>  > read.delim(x2, skip=1, nrows=5)[,1:3]
> 
> Error in open.connection(file, "r") : unable to open connection
> In addition: Warning messages:
> 1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection 
> (file, "r")
> 2: -> GET /genomes/lenvs.cgi?dump=selected HTTP/1.0
> 
> Host: www.ncbi.nlm.nih.gov
> 
> Pragma: no-cache
> 
> in: open.connection(file, "r")
> 3: <- HTTP/1.1 500 Internal Server Error in: open.connection(file, "r")
> 4: <- Date: Wed, 14 Nov 2007 18:04:26 GMT in: open.connection(file, "r")
> 5: <- Server: Apache in: open.connection(file, "r")
> 6: <- Content-Type: text/html; charset=ISO-8859-1 in: open.connection 
> (file, "r")
> 7: <- Vary: Accept-Encoding in: open.connection(file, "r")
> 8: <- Connection: close in: open.connection(file, "r")
> 9: Code 500, content-type 'text/html; charset=ISO-8859-1' in:  
> open.connection(file, "r")
> 10: cannot open: HTTP status was '500 Internal Server Error' in:  
> open.connection(file, "r")
> 
> Also, I can't even read lines from the main page.
> 
>  > readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10)
> Error in file(con, "r") : unable to open connection
> ...
> ## now I'm just guessing...
>  > readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10,  
> encoding="ISO-8859-1")
> Error in file(con, "r") : unable to open connection
> ...
> 
> 
> Download.file works fine, but I would like to avoid this if possible.
> 
>  > capabilities()[5]
> http/ftp
>      TRUE
> 
>  > download.file("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi? 
> dump=selected", "lenvs.tab")
>  > read.delim("lenvs.tab", skip=1, nrows=5)[,1:3]
>    X...Columns.  
> Parent.ProjectID                                         ProjectID
> 1        19733            13694       Global Ocean Sampling  
> Expedition Metagenome
> 2        20823            13696  5-Way (CG) Acid Mine Drainage  
> Biofilm Metagenome
> 3            -            13699                Waseca County Farm  
> Soil Metagenome
> 4            -            13702 Methane-Oxidizing Archaea from Deep- 
> Sea Sediments
> 5            -            13729                     Pacific Beach  
> Sand Metagenome
> 
> 
> 
> Thanks for your help.  Hopefully this is something simple that I  
> missed in the documentation/help.
> 
> Chris
> 
> 
> 
> --
> -------------------
> Chris Stubben
> 
> Los Alamos National Lab
> BioScience Division
> MS M888
> Los Alamos, NM 87545
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHO0rz9p/Jzwa2QP4RAizoAJ9qJ45Ndp7vfhrtngBv4udNABDDOACfSy7i
BXuOIp4U8iiY5SCnmP9TNfQ=
=U8cw
-----END PGP SIGNATURE-----



More information about the R-help mailing list