[R] reading tables from url

stubben stubben at lanl.gov
Wed Nov 14 19:49:47 CET 2007


I'm trying to read some web tables directly into R.  These are both  
genome sequencing projects (eukaryotes and metagenomes) from NCBI and  
look very similar;  however, only the first one works.

http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi

I added  ?dump=selected to the end of the url string to get a tab- 
delimited file (which is what happens if you click the Save button on  
either page).

 > options(internet.info=0)

## this one works

 > x1<-url("http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi? 
dump=selected")
 > read.delim(x1, skip=1, nrows=5)[,1:3]

   X...Columns.                     ProjectID Organism.Name
1        20303 Acanthamoeba castellanii Neff      Protists
2        13657      Acyrthosiphon pisum LSR1       Animals
3        12434       Aedes aegypti Liverpool       Animals
4        12635 Ajellomyces capsulatus G186AR         Fungi
5        12653  Ajellomyces capsulatus G217B         Fungi

Warning messages:
1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection 
(file, "r")
2: -> GET /genomes/leuks.cgi?dump=selected HTTP/1.0

Host: www.ncbi.nlm.nih.gov

Pragma: no-cache

in: open.connection(file, "r")
3: <- HTTP/1.1 200 OK in: open.connection(file, "r")
4: <- Date: Wed, 14 Nov 2007 18:03:29 GMT in: open.connection(file, "r")
5: <- Server: Apache in: open.connection(file, "r")
6: <- Content-Disposition: attachment; filename="untitle.txt" in:  
open.connection(file, "r")
7: <- Content-Type: application/force-download in: open.connection 
(file, "r")
8: <- Vary: Accept-Encoding in: open.connection(file, "r")
9: <- Connection: close in: open.connection(file, "r")
10: Code 200, content-type 'application/force-download' in:  
open.connection(file, "r")


## this one fails to open a connection

 > x2<-url("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi? 
dump=selected")
 > read.delim(x2, skip=1, nrows=5)[,1:3]

Error in open.connection(file, "r") : unable to open connection
In addition: Warning messages:
1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection 
(file, "r")
2: -> GET /genomes/lenvs.cgi?dump=selected HTTP/1.0

Host: www.ncbi.nlm.nih.gov

Pragma: no-cache

in: open.connection(file, "r")
3: <- HTTP/1.1 500 Internal Server Error in: open.connection(file, "r")
4: <- Date: Wed, 14 Nov 2007 18:04:26 GMT in: open.connection(file, "r")
5: <- Server: Apache in: open.connection(file, "r")
6: <- Content-Type: text/html; charset=ISO-8859-1 in: open.connection 
(file, "r")
7: <- Vary: Accept-Encoding in: open.connection(file, "r")
8: <- Connection: close in: open.connection(file, "r")
9: Code 500, content-type 'text/html; charset=ISO-8859-1' in:  
open.connection(file, "r")
10: cannot open: HTTP status was '500 Internal Server Error' in:  
open.connection(file, "r")

Also, I can't even read lines from the main page.

 > readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10)
Error in file(con, "r") : unable to open connection
...
## now I'm just guessing...
 > readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10,  
encoding="ISO-8859-1")
Error in file(con, "r") : unable to open connection
...


Download.file works fine, but I would like to avoid this if possible.

 > capabilities()[5]
http/ftp
     TRUE

 > download.file("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi? 
dump=selected", "lenvs.tab")
 > read.delim("lenvs.tab", skip=1, nrows=5)[,1:3]
   X...Columns.  
Parent.ProjectID                                         ProjectID
1        19733            13694       Global Ocean Sampling  
Expedition Metagenome
2        20823            13696  5-Way (CG) Acid Mine Drainage  
Biofilm Metagenome
3            -            13699                Waseca County Farm  
Soil Metagenome
4            -            13702 Methane-Oxidizing Archaea from Deep- 
Sea Sediments
5            -            13729                     Pacific Beach  
Sand Metagenome



Thanks for your help.  Hopefully this is something simple that I  
missed in the documentation/help.

Chris



--
-------------------
Chris Stubben

Los Alamos National Lab
BioScience Division
MS M888
Los Alamos, NM 87545



More information about the R-help mailing list