[R] read.table performance

Tue Dec 6 20:06:58 CET 2011

On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com> wrote:
> ** Disclaimer: I'm looking for general suggestions **
> I'm sorry, but can't send out the file I'm using, so there is no
> reproducible example.
>
> I'm using read.table and it's taking over 30 seconds to read a tiny file.
> The strange thing is that it takes roughly the same amount of time if the
> file is 100 times larger.
>
> After re-reviewing the data Import / Export manual I think the best
> approach would be to use Python, or perhaps the readLines function, but I
> was hoping to understand why the simple read.table approach wasn't working
> as expected.
>
> Some relevant facts:
>
>   1. There are about 3700 columns.  Maybe this is the problem?  Still the
>   file size is not very large.
>   2. The file encoding is ANSI, but I'm not specifying that in the
>   function.  Setting fileEncoding="ANSI" produces an "unsupported conversion"
>   error
>   3. readLines imports the lines quickly
>   4. scan imports the file quickly also
>
> Obviously, scan and readLines would require more coding to identify
> columns, etc.
>
> my code:
> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
> header=TRUE))
>
> It's taking 33.4 seconds and the file size is only 315 kb!
>

You could also try read.csv.sql in the sqldf package and see whether
or not that is any faster. Be sure you are using RSQLite 0.11.0 (and
not an earlier version) with that since earlier versions were compiled
to work with only a maximum of 999 columns.

library(sqldf)
DF <- read.csv.sql("C:\\test.txt", header = TRUE, sep = "\t")

You may or may not have to use the eol= argument to specify line
endings.  See ?read.csv.sql

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com