[R] scan() vs readChar() speed

Duncan Murdoch murdoch.duncan at gmail.com
Mon Apr 2 01:04:38 CEST 2012


On 12-04-01 2:58 AM, baptiste auguie wrote:
> Dear list,
>
> I am trying to find a fast solution to read moderately large (1 -- 10
> million entries) text files containing only tab-delimited numeric
> values. My test file is the following,
>
> nr<- 1000
> nc<- 5000
>
> m<- matrix(round(rnorm(nr*nc),3),nr=nr)
> write.table(m, file = "a.txt", append=FALSE,
>              row.names = FALSE, col.names = FALSE)
>
>
> scan() is faster than read.table(), as expected, but still quite slow
> compared to Matlab for example. Based on archived discussions on this
> list and Stack Overflow, I tried readChar(); it's really fast.
> However, it returns a long character string, where I really want
> numeric values. I can use as.numeric(strsplit()), but to my complete
> surprise it is faster to run scan() on this text string. Consider the
> following comparison (I use the command line wc to optimize the memory
> allocation),

Tell it the types of the columns, and it will go a bit faster.

Duncan Murdoch

>
> load_file1<- function(f){
>    ## ask wc the number of words
>    n<- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)),
>              what=list(integer(), character()), quiet=TRUE)[[1]]
>    all<- scan(f, nmax=n, quiet=TRUE)
>    invisible(all)
> }
>
> load_file2<- function(f){
>    ## ask wc the number of characters
>    n<- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)),
>              what=list(integer(), character()), quiet=TRUE)[[1]]
>    tc<- textConnection(readChar(f, n))
>    all<- scan(tc, quiet=TRUE, multi.line = FALSE)
>    close(tc)
>    invisible(all)
> }
>
>
> system.time(a<- load_file1("a.txt"))
>   ## user  system elapsed
>   ##  7.805   0.138   8.026
> system.time(b<- load_file2("a.txt"))
>   ## user  system elapsed
>   ##  2.182   0.301   2.538
> all.equal(a, b)
> ##>  [1] TRUE
>
>
> Could someone explain to me why it is faster to scan a textConnection
> than the original file? Have I missed a better solution?
>
> Thanks,
>
> baptiste
>
> sessionInfo()
> R version 2.15.0 RC (2012-03-29 r58868)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list