[R] troubles reading a text file

David Winsemius dwinsemius at comcast.net
Sun Dec 16 05:45:34 CET 2012


On Dec 15, 2012, at 2:23 PM, <Igor.Drobyshev2 at uqat.ca> wrote:

> Dear R experts,
> 
> For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome).
> 
> This is the data (gridded temperature reconstruction)
> ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt
> 
> And this is original data description:
> ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt
> Basically, it is says "space-delimited ASCII format" there ...
> 
> I tried this:
> Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="")
> 
> But ..
> 
> 
>> Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="")
> Error in read.table(FileName, skip = 7, header = FALSE, sep = "") :
>  empty beginning of file
> 

 After inspecting a small (8 MB fragment downloaded with an ftp client) with both Firefox and TextEdit.app and seeing that they reported this to be UTF-16 encoded, I saved it from TextEdit as UTF-8 and then could view it with R readLines. These are the first 7 lines and the beginning of the eighth:

> readLines("~/Downloads/temp-mon2.txt", n=10)
 [1] "NAME \"Monthly European Temperatures 1766-2000 [T=2m, Celsius]\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [2] "LONGITUDES\t180\t50.00W\t40.00E\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [3] "LATITUDES\t100\t80.00N\t30.00N\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [4] "NODATA_STRING\tNA"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [5] "NUMBER_OF_ROWS\t2820"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [6] "NUMBER_OF_COLUMNS\t18001\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [7] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [8] "YYYYMM\t79.75N/49.75W\t79.75N/49.25W\t79.75N/48.75W\t79.75N/48.25W\t79.75N/47.75W\t79.75N/47.25W\t79.75N/46.75W\t79.75N/46.25W\t79.75N/45.75W\t79.75N/45.25W\t79.75N/44.75W\t79.75N/44.25W\t79.7

As you can readily see it isa tab-separated file. I was able to get partial success ( reading the first three lines anyway) with:

> inp <- read.table("~/Downloads/temp-mon.txt",  nrow=3, skip =7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16")
> inp[1 , 1:10]
  YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512         -32.61         -32.92         -33.34         -33.65         -34.09         -34.21         -34.65         -34.98         -35.43
> inp[ , 1:10]
  YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512         -32.61         -32.92         -33.34         -33.65         -34.09         -34.21         -34.65         -34.98         -35.43
2 176601         -31.89         -31.96         -32.26         -32.48         -32.71         -33.03         -33.29         -33.41         -33.76
3 176602         -34.31         -34.40         -34.60         -34.79         -35.01         -35.13         -35.46         -35.57         -35.91

> 
> Trying read.csv gives this:
> 
> 
> Error: cannot allocate vector of size 370.5 Mb

That on the other hand suggests you have inadequate machine resources for this job. Perhaps you should be thinking of using other tools than R for this project ... or buying more ram. You should probably have 32 GB for a job this size.
> 
> I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns ..
> 
> I believe the problem is with some special encoding but I cannot figure out how to go around it.


Partially correct but perhaps your problems are multifactorial. 

I was able to get this to "work" from that webste:

> inp <- read.table(file=url("ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt", encoding="UTF-16"), nrow=3 , skip =7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16")

> str(inp[ , 1:10])
'data.frame':	3 obs. of  10 variables:
 $ YYYYMM        : int  176512 176601 176602
 $ X79.75N.49.75W: num  -32.6 -31.9 -34.3
 $ X79.75N.49.25W: num  -32.9 -32 -34.4
 $ X79.75N.48.75W: num  -33.3 -32.3 -34.6
 $ X79.75N.48.25W: num  -33.6 -32.5 -34.8
 $ X79.75N.47.75W: num  -34.1 -32.7 -35
 $ X79.75N.47.25W: num  -34.2 -33 -35.1
 $ X79.75N.46.75W: num  -34.6 -33.3 -35.5
 $ X79.75N.46.25W: num  -35 -33.4 -35.6
 $ X79.75N.45.75W: num  -35.4 -33.8 -35.9

-- 

David Winsemius
Alameda, CA, USA




More information about the R-help mailing list