[R] troubles reading a text file

Jeffrey Dick j3ffdick at gmail.com
Sun Dec 16 05:30:34 CET 2012


Hi Igor,

It appears that the encoding is UTF-16.

> readLines("temp-mon.txt")
 [1] "þÿ" ""      ""      ""      ""      ""      ""      ""      ""
   ""      ""      ""      ""
[14] ""      ""      ""      ""      ""      ""      ""

A search for "þÿ" leads to the Wikipedia page
http://en.wikipedia.org/wiki/Byte_order_mark, specifically UTF-16
section.

> options(encoding="UTF-16")
> system.time(Temperature<-read.table("temp-mon.txt",skip = 7, header = TRUE, na.strings="NA",sep=""))
   user  system elapsed
 28.556   0.112  28.712
> ncol(Temperature)
[1] 18001
> Temperature[, 1:10]
  YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W
1 176512         -32.61         -32.92         -33.34         -33.65
      -34.09         -34.21
2 176601         -31.89         -31.96         -32.26         -32.48
      -32.71         -33.03
  X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1         -34.65         -34.98         -35.43
2         -33.29         -33.41         -33.76

Here you can see that I have downloaded just the first 1 MB of the
file, so it only has two lines after the header, but 28 seconds to
read it... I'm not sure how long it would take to read.table on the
whole ~600 MB file.

scan() might be faster:
(and this does not require setting options(encoding="UTF-16"))

> system.time(Temperature <- scan("temp-mon.txt", fileEncoding="UTF-16", skip=8))
Read 36002 items
   user  system elapsed
  0.104   0.000   0.104
> Temperature <- matrix(Temperature, ncol=18001, byrow=TRUE)
> Temperature.colnames <- scan("temp-mon.txt", character(), fileEncoding="UTF-16", skip=7, nmax=18001)
Read 18001 items
> colnames(Temperature) <- Temperature.colnames
> Temperature[, 1:10]
     YYYYMM 79.75N/49.75W 79.75N/49.25W 79.75N/48.75W 79.75N/48.25W
79.75N/47.75W 79.75N/47.25W
[1,] 176512        -32.61        -32.92        -33.34        -33.65
    -34.09        -34.21
[2,] 176601        -31.89        -31.96        -32.26        -32.48
    -32.71        -33.03
     79.75N/46.75W 79.75N/46.25W 79.75N/45.75W
[1,]        -34.65        -34.98        -35.43
[2,]        -33.29        -33.41        -33.76

(note the different colnames, similar to using check.names=FALSE in
read.table, and the result is a matrix, not a data frame as returned
by read.table)

HTH,
Jeff

On Sun, Dec 16, 2012 at 6:23 AM,  <Igor.Drobyshev2 at uqat.ca> wrote:
> Dear R experts,
>
> For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome).
>
> This is the data (gridded temperature reconstruction)
> ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt
>
> And this is original data description:
> ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt
> Basically, it is says "space-delimited ASCII format" there ...
>
> I tried this:
> Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="")
>
> But ..
>
>
>> Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="")
> Error in read.table(FileName, skip = 7, header = FALSE, sep = "") :
>   empty beginning of file
>
>
>
>
>
> Trying read.csv gives this:
>
>
>
> Error: cannot allocate vector of size 370.5 Mb
>
>
>
> I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns ..
>
>
>
> I believe the problem is with some special encoding but I cannot figure out how to go around it.
>
>
>
> Could some of you give me any hint on that?
>
>
>
> many thanks in advance
>
> Igor
>
> Igor Drobyshev
> Dendrochronological laboratory at Station de Recheche FERLD, director
> Chaire industrielle CRSNG-UQAT-UQAM en aménagement forestier durable
> Université du Québec en Abitibi-Témiscamingue
> 445 boul . de l'Université
> Rouyn-Noranda, QC
> Canada J9X5E4
> http://www.dendro.uqat.ca/
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list