[R] Speeding reading of large file

Thu Nov 29 03:04:44 CET 2012

How long was the file that you tested?  Here is a test with a file
that is 110400 lines long with 4416 replicated headers that will have
to be removed.  Using 'text=' or textConnection is very slow for these
operations.

Writing to a temporary file can be faster for especially large files.
Notice that this is the fastest method for this file.

Here are three approaches and their times:

############################

> system.time({
+     # approach #1 - read in file and then delete rows with NAs
+     x <- read.table('/temp/text.txt', as.is = TRUE, header = TRUE)
+     # convert to numeric
+     x[] <- lapply(x, as.numeric)
+     x <- x[!is.na(x[,1]), ]
+ })
   user  system elapsed
   0.70    0.00    0.72
Warning messages:
1: In lapply(x, as.numeric) : NAs introduced by coercion
2: In lapply(x, as.numeric) : NAs introduced by coercion
3: In lapply(x, as.numeric) : NAs introduced by coercion
4: In lapply(x, as.numeric) : NAs introduced by coercion
5: In lapply(x, as.numeric) : NAs introduced by coercion
> str(x)
'data.frame':   105984 obs. of  5 variables:
 $ a: num  1 1 1 1 1 1 1 1 1 1 ...
 $ b: num  2 2 2 2 2 2 2 2 2 2 ...
 $ c: num  3 3 3 3 3 3 3 3 3 3 ...
 $ d: num  4 4 4 4 4 4 4 4 4 4 ...
 $ e: num  5 5 5 5 5 5 5 5 5 5 ...
> colSums(x)
     a      b      c      d      e
105984 211968 317952 423936 529920
>
> system.time({
+     # approach #2 -- read the lines, delete header, rewrite to temp file
+     # and then read in with read.table
+     x <- readLines('/temp/text.txt')
+     firstLine <- x[1L]  # save header since deleted by 'grepl'
+     x <- c(firstLine, x[grepl("^[0-9]", x)])  # accept only lines
that start with numeric
+     temp <- tempfile()
+     writeLines(x, temp)
+     x <- read.table(temp, as.is = TRUE, header = TRUE)
+ })
   user  system elapsed
   0.55    0.02    0.56
> str(x)
'data.frame':   105984 obs. of  5 variables:
 $ a: int  1 1 1 1 1 1 1 1 1 1 ...
 $ b: int  2 2 2 2 2 2 2 2 2 2 ...
 $ c: int  3 3 3 3 3 3 3 3 3 3 ...
 $ d: int  4 4 4 4 4 4 4 4 4 4 ...
 $ e: int  5 5 5 5 5 5 5 5 5 5 ...
> colSums(x)
     a      b      c      d      e
105984 211968 317952 423936 529920

> system.time({
+     # approach #3 -- read the lines, delete header, then use 'text'
on read.table
+     x <- readLines('/temp/text.txt')
+     firstLine <- x[1L]
+     x <- c(firstLine, x[grepl("^[0-9]", x)])
+     x <- read.table(text = x, as.is = TRUE, header = TRUE)
+ })
   user  system elapsed
  29.01    0.01   29.62
> str(x)
'data.frame':   105984 obs. of  5 variables:
 $ a: int  1 1 1 1 1 1 1 1 1 1 ...
 $ b: int  2 2 2 2 2 2 2 2 2 2 ...
 $ c: int  3 3 3 3 3 3 3 3 3 3 ...
 $ d: int  4 4 4 4 4 4 4 4 4 4 ...
 $ e: int  5 5 5 5 5 5 5 5 5 5 ...
> colSums(x)
     a      b      c      d      e
105984 211968 317952 423936 529920

On Wed, Nov 28, 2012 at 7:01 PM, Nordlund, Dan (DSHS/RDA)
<NordlDJ at dshs.wa.gov> wrote:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Fisher Dennis
>> Sent: Wednesday, November 28, 2012 11:42 AM
>> To: dcarlson at tamu.edu
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Speeding reading of large file
>>
>> An interesting approach -- I lose the column names (which I need) but I
>> could get them with something cute such as:
>>       1.  read the first few lines only with readLines(FILENAME, n=10)
>>       2.  use your approach to read.table -- this will grab the column
>> names
>>       3.  replace the headers in the full version with the correct
>> column names
>>
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>>
>> On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
>>
>> > Using your first approach, this should be faster
>> >
>> > raw <- readLines(con=filename)
>> > dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
>> >
>> > ----------------------------------------------
>> > David L Carlson
>> > Associate Professor of Anthropology
>> > Texas A&M University
>> > College Station, TX 77843-4352
>> >
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> project.org] On Behalf Of Fisher Dennis
>> >> Sent: Wednesday, November 28, 2012 11:43 AM
>> >> To: r-help at r-project.org
>> >> Subject: [R] Speeding reading of large file
>> >>
>> >> R 2.15.1
>> >> OS X and Windows
>> >>
>> >> Colleagues,
>> >>
>> >> I have a file that looks that this:
>> >> TABLE NO.  1
>> >> PTID        TIME        AMT         FORM        PERIOD      IPRED
>> >> CWRES       EVID        CP          PRED        RES         WRES
>> >>  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>
>> >> The file is reasonably large (> 10^6 lines) and the two line header
>> is
>> >> repeated periodically in the file.
>> >> I need to read this file in as a data frame.  Note that the number
>> of
>> >> columns, the column headers, and the number of replicates of the
>> >> headers are not known in advance.
>> >>
>> >> I have tried two approaches to this:
>> >>    First Approach:
>> >>            1.  readLines(FILENAME) to read in the file
>> >>            2.  use grep to find the repeat headers; strip out the
>> >> repeat headers
>> >>            3.  write() the object to tempfile, read in that temporary
>> >> file using read.table(tempfile, header=TRUE, skip=1) [an alternative
>> is
>> >> to use textConnection but that does not appear to speed things]
>> >>
>> >>    Second Approach:
>> >>            1.  TEMP        <- read.table(FILENAME, header=TRUE, skip=1,
>> >> fill=TRUE, as.is=TRUE)
>> >>            2.  get rid of the errant entries with:
>> >>                    TEMP[!is.na(as.numeric(TEMP[,1])),]
>> >>            3.  reading of the character entries forced all columns to
>> >> character mode.  Therefore, I convert each column to numeric:
>> >>                    for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
>> >> as.numeric(TEMP[,COL])
>> >> The second approach is ~ 20% faster than the first.  With the second
>> >> approach, the conversion to numeric occupies 50% of the elapsed
>> time.
>> >>
>> >> Is there some approach that would be much faster?  For example,
>> would a
>> >> vectorized approach to conversion to numeric improve throughput?
>> Or,
>> >> is there some means to ensure that all data are read as numeric (I
>> >> tried to use colClasses but that triggered an error when the text
>> >> string was encountered).
>> >>
>> >> ############################
>> >> A dput version of the data is:
>> >> c("TABLE NO.  1", " PTID        TIME        AMT         FORM
>> >> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> >> RES         WRES",
>> >> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
>> 0.0000E+00
>> >> 0.0000E+00"
>> >> )
>> >>
>> >> This can be assembled into a large dataset and written to a file
>> named
>> >> FILENAME with the following code:
>> >> cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM
>> >> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> >> RES         WRES",
>> >> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
>> 0.0000E+00
>> >> 0.0000E+00"
>> >> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
>> >>
>> >>
>> >> Dennis
>> >>
>> >>
>
> Dennis,
>
> I used your code to create the test file, and then used two different method to read the file
>
> # method 1
> system.time({
> fisher <- read.table('c:/tmp/fisher.txt', header=TRUE,skip=1,fill=TRUE, as.is=TRUE)
> fisher <- data.frame(apply(fisher,2,as.numeric))
> fisher <- fisher[!is.na(fisher$PTID),]
> })
>    user  system elapsed
>    0.14    0.00    0.14
> There were 12 warnings (use warnings() to see them)
>
> # method 2
> system.time({
> raw <- readLines(con='c:/tmp/fisher.txt')
> fisher2 <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE, fill=TRUE)
> names <- read.table('c:/tmp/fisher.txt',header=TRUE,skip=1,nrows=1)
> colnames(fisher2) <- colnames(names)
> })
>    user  system elapsed
>    1.31    0.00    1.31
>
> Method 1 was substantially faster than method 2.  One thing I don't like about method 1 is the warnings (about NA's being created by as.numeric).  However they are essentially harmless.
>
>
> Hope this is helpful,
>
> Dan
>
> Daniel J. Nordlund
> Washington State Department of Social and Health Services
> Planning, Performance, and Accountability
> Research and Data Analysis Division
> Olympia, WA 98504-5204
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.