[R] Speeding reading of large file

jim holtman jholtman at gmail.com
Thu Nov 29 03:04:44 CET 2012


How long was the file that you tested?  Here is a test with a file
that is 110400 lines long with 4416 replicated headers that will have
to be removed.  Using 'text=' or textConnection is very slow for these
operations.

Writing to a temporary file can be faster for especially large files.
Notice that this is the fastest method for this file.

Here are three approaches and their times:

############################

> system.time({
+     # approach #1 - read in file and then delete rows with NAs
+     x <- read.table('/temp/text.txt', as.is = TRUE, header = TRUE)
+     # convert to numeric
+     x[] <- lapply(x, as.numeric)
+     x <- x[!is.na(x[,1]), ]
+ })
   user  system elapsed
   0.70    0.00    0.72
Warning messages:
1: In lapply(x, as.numeric) : NAs introduced by coercion
2: In lapply(x, as.numeric) : NAs introduced by coercion
3: In lapply(x, as.numeric) : NAs introduced by coercion
4: In lapply(x, as.numeric) : NAs introduced by coercion
5: In lapply(x, as.numeric) : NAs introduced by coercion
> str(x)
'data.frame':   105984 obs. of  5 variables:
 $ a: num  1 1 1 1 1 1 1 1 1 1 ...
 $ b: num  2 2 2 2 2 2 2 2 2 2 ...
 $ c: num  3 3 3 3 3 3 3 3 3 3 ...
 $ d: num  4 4 4 4 4 4 4 4 4 4 ...
 $ e: num  5 5 5 5 5 5 5 5 5 5 ...
> colSums(x)
     a      b      c      d      e
105984 211968 317952 423936 529920
>
> system.time({
+     # approach #2 -- read the lines, delete header, rewrite to temp file
+     # and then read in with read.table
+     x <- readLines('/temp/text.txt')
+     firstLine <- x[1L]  # save header since deleted by 'grepl'
+     x <- c(firstLine, x[grepl("^[0-9]", x)])  # accept only lines
that start with numeric
+     temp <- tempfile()
+     writeLines(x, temp)
+     x <- read.table(temp, as.is = TRUE, header = TRUE)
+ })
   user  system elapsed
   0.55    0.02    0.56
> str(x)
'data.frame':   105984 obs. of  5 variables:
 $ a: int  1 1 1 1 1 1 1 1 1 1 ...
 $ b: int  2 2 2 2 2 2 2 2 2 2 ...
 $ c: int  3 3 3 3 3 3 3 3 3 3 ...
 $ d: int  4 4 4 4 4 4 4 4 4 4 ...
 $ e: int  5 5 5 5 5 5 5 5 5 5 ...
> colSums(x)
     a      b      c      d      e
105984 211968 317952 423936 529920

> system.time({
+     # approach #3 -- read the lines, delete header, then use 'text'
on read.table
+     x <- readLines('/temp/text.txt')
+     firstLine <- x[1L]
+     x <- c(firstLine, x[grepl("^[0-9]", x)])
+     x <- read.table(text = x, as.is = TRUE, header = TRUE)
+ })
   user  system elapsed
  29.01    0.01   29.62
> str(x)
'data.frame':   105984 obs. of  5 variables:
 $ a: int  1 1 1 1 1 1 1 1 1 1 ...
 $ b: int  2 2 2 2 2 2 2 2 2 2 ...
 $ c: int  3 3 3 3 3 3 3 3 3 3 ...
 $ d: int  4 4 4 4 4 4 4 4 4 4 ...
 $ e: int  5 5 5 5 5 5 5 5 5 5 ...
> colSums(x)
     a      b      c      d      e
105984 211968 317952 423936 529920


On Wed, Nov 28, 2012 at 7:01 PM, Nordlund, Dan (DSHS/RDA)
<NordlDJ at dshs.wa.gov> wrote:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Fisher Dennis
>> Sent: Wednesday, November 28, 2012 11:42 AM
>> To: dcarlson at tamu.edu
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Speeding reading of large file
>>
>> An interesting approach -- I lose the column names (which I need) but I
>> could get them with something cute such as:
>>       1.  read the first few lines only with readLines(FILENAME, n=10)
>>       2.  use your approach to read.table -- this will grab the column
>> names
>>       3.  replace the headers in the full version with the correct
>> column names
>>
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>>
>> On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
>>
>> > Using your first approach, this should be faster
>> >
>> > raw <- readLines(con=filename)
>> > dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
>> >
>> > ----------------------------------------------
>> > David L Carlson
>> > Associate Professor of Anthropology
>> > Texas A&M University
>> > College Station, TX 77843-4352
>> >
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> project.org] On Behalf Of Fisher Dennis
>> >> Sent: Wednesday, November 28, 2012 11:43 AM
>> >> To: r-help at r-project.org
>> >> Subject: [R] Speeding reading of large file
>> >>
>> >> R 2.15.1
>> >> OS X and Windows
>> >>
>> >> Colleagues,
>> >>
>> >> I have a file that looks that this:
>> >> TABLE NO.  1
>> >> PTID        TIME        AMT         FORM        PERIOD      IPRED
>> >> CWRES       EVID        CP          PRED        RES         WRES
>> >>  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
>> 0.0000E+00
>> >> 0.0000E+00
>> >>
>> >> The file is reasonably large (> 10^6 lines) and the two line header
>> is
>> >> repeated periodically in the file.
>> >> I need to read this file in as a data frame.  Note that the number
>> of
>> >> columns, the column headers, and the number of replicates of the
>> >> headers are not known in advance.
>> >>
>> >> I have tried two approaches to this:
>> >>    First Approach:
>> >>            1.  readLines(FILENAME) to read in the file
>> >>            2.  use grep to find the repeat headers; strip out the
>> >> repeat headers
>> >>            3.  write() the object to tempfile, read in that temporary
>> >> file using read.table(tempfile, header=TRUE, skip=1) [an alternative
>> is
>> >> to use textConnection but that does not appear to speed things]
>> >>
>> >>    Second Approach:
>> >>            1.  TEMP        <- read.table(FILENAME, header=TRUE, skip=1,
>> >> fill=TRUE, as.is=TRUE)
>> >>            2.  get rid of the errant entries with:
>> >>                    TEMP[!is.na(as.numeric(TEMP[,1])),]
>> >>            3.  reading of the character entries forced all columns to
>> >> character mode.  Therefore, I convert each column to numeric:
>> >>                    for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
>> >> as.numeric(TEMP[,COL])
>> >> The second approach is ~ 20% faster than the first.  With the second
>> >> approach, the conversion to numeric occupies 50% of the elapsed
>> time.
>> >>
>> >> Is there some approach that would be much faster?  For example,
>> would a
>> >> vectorized approach to conversion to numeric improve throughput?
>> Or,
>> >> is there some means to ensure that all data are read as numeric (I
>> >> tried to use colClasses but that triggered an error when the text
>> >> string was encountered).
>> >>
>> >> ############################
>> >> A dput version of the data is:
>> >> c("TABLE NO.  1", " PTID        TIME        AMT         FORM
>> >> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> >> RES         WRES",
>> >> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
>> 0.0000E+00
>> >> 0.0000E+00"
>> >> )
>> >>
>> >> This can be assembled into a large dataset and written to a file
>> named
>> >> FILENAME with the following code:
>> >> cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM
>> >> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> >> RES         WRES",
>> >> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
>> 0.0000E+00
>> >> 0.0000E+00",
>> >> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
>> 0.0000E+00
>> >> 0.0000E+00"
>> >> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
>> >>
>> >>
>> >> Dennis
>> >>
>> >>
>
> Dennis,
>
> I used your code to create the test file, and then used two different method to read the file
>
> # method 1
> system.time({
> fisher <- read.table('c:/tmp/fisher.txt', header=TRUE,skip=1,fill=TRUE, as.is=TRUE)
> fisher <- data.frame(apply(fisher,2,as.numeric))
> fisher <- fisher[!is.na(fisher$PTID),]
> })
>    user  system elapsed
>    0.14    0.00    0.14
> There were 12 warnings (use warnings() to see them)
>
> # method 2
> system.time({
> raw <- readLines(con='c:/tmp/fisher.txt')
> fisher2 <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE, fill=TRUE)
> names <- read.table('c:/tmp/fisher.txt',header=TRUE,skip=1,nrows=1)
> colnames(fisher2) <- colnames(names)
> })
>    user  system elapsed
>    1.31    0.00    1.31
>
> Method 1 was substantially faster than method 2.  One thing I don't like about method 1 is the warnings (about NA's being created by as.numeric).  However they are essentially harmless.
>
>
> Hope this is helpful,
>
> Dan
>
> Daniel J. Nordlund
> Washington State Department of Social and Health Services
> Planning, Performance, and Accountability
> Research and Data Analysis Division
> Olympia, WA 98504-5204
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




More information about the R-help mailing list