[R] speeding read.table

Thu Oct 18 20:27:21 CEST 2012

Hello,

Time down by a factor of 4. It still takes some minutes, 2 mins for a 
file of 380Mb/3.6M lines. So maybe system commands (maybe awk?) can do 
the job better.

fun <- function(infile, outfile, lines = 10000L){
     remove <- function(x){
         i1 <- grep("TABLE", x)
         i2 <- grep("COL", x)
         x[-c(i1, i2)]
     }
     fin <- file(infile, open = "rt")
     on.exit(close(fin))
     while(TRUE){
         x <- try(readLines(fin, n = lines))
         if(class(x) == "try-error") return(NULL)
         y <- remove(x[ x != "" ])
         if(length(y) == 0) return(NULL)
         lst <- lapply(strsplit(y, " "), function(.y)
             as.numeric(.y[ .y != "" ]))
         mat <- do.call(rbind, lst)
         write.table(mat, outfile, append = TRUE, row.names = FALSE, 
col.names = FALSE)
     }
}

fun("test", "clean")

Hope this helps,

Rui Barradas
Em 18-10-2012 18:14, Rui Barradas escreveu:
> Hello,
>
> The problem doesn't seem to be memory swaps. I've tried with a 380Mb 
> file (3.6M lines) and it took aroun 8.5 minutes. I'll think of 
> something else and write back.
>
> Rui Barradas
> Em 18-10-2012 16:42, Fisher Dennis escreveu:
>> Rui
>>
>> I tried something similar to this.  To my surprise, it was quite slow 
>> (it is still running after many minutes).  I suspect that that 
>> textConnection is a slow process compared to actually reading from 
>> the drive.  It is possible that the problem is that the object is so 
>> large that it is being swapped in and out of virtual memory -- 
>> however, this machine has 12 GB RAM so this seems unlikely.
>>
>> Dennis
>>
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>>
>> On Oct 18, 2012, at 8:35 AM, Rui Barradas wrote:
>>
>>> Hello,
>>>
>>> Try the following, readaing your file into 'x', using readLines.
>>>
>>>
>>>
>>> tc <- textConnection("
>>> TABLE NO.  1
>>> COL1        COL2        COL3        COL4        COL5 COL6        
>>> COL7        COL8        COL9        COL10 COL11       COL12
>>>   1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00 
>>> 1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 
>>> 0.0000E+00  0.0000E+00
>>>   1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00 
>>> 1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08 
>>> 0.0000E+00  0.0000E+00
>>>   1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00 
>>> 1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13 
>>> 0.0000E+00  0.0000E+00
>>>
>>> TABLE NO.  1
>>> COL1        COL2        COL3        COL4        COL5 COL6        
>>> COL7        COL8        COL9        COL10 COL11       COL12
>>>   1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00 
>>> 1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 
>>> 0.0000E+00  0.0000E+00
>>>   1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00 
>>> 1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08 
>>> 0.0000E+00  0.0000E+00
>>>   1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00 
>>> 1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13 
>>> 0.0000E+00  0.0000E+00
>>> ")
>>>
>>> x <- readLines(tc)
>>> close(tc)
>>>
>>> #------------------------ starts here
>>> x <- x[ x != "" ]
>>>
>>> i1 <- grep("TABLE", x)
>>> i2 <- grep("COL", x)
>>> y <- x[-c(i1, i2)]
>>>
>>> tc <- textConnection(y)
>>> dat <- read.table(tc)
>>> close(tc)
>>>
>>> cnames <- unlist(strsplit(x[2], " "))
>>> names(dat) <- cnames[cnames != ""]
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>> Em 18-10-2012 14:57, Fisher Dennis escreveu:
>>>> R 2.15.1
>>>> OS X
>>>>
>>>> Colleagues,
>>>>
>>>> I am reading a 1 GB file into R using read.table.  The file 
>>>> consists of 100 tables, each of which is headed by two lines of 
>>>> characters.
>>>> The first of these lines is:
>>>>     TABLE NO.  1
>>>> The second is a list of column headers.
>>>>
>>>> For example:
>>>> TABLE NO.  1
>>>>   COL1        COL2        COL3        COL4 COL5        COL6        
>>>> COL7        COL8        COL9 COL10       COL11       COL12
>>>>    1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00  
>>>> 1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  
>>>> 0.0000E+00  0.0000E+00
>>>>    1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00  
>>>> 1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00 1.9281E-08  
>>>> 0.0000E+00  0.0000E+00
>>>>    1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00  
>>>> 1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00 5.1115E-13  
>>>> 0.0000E+00  0.0000E+00
>>>>
>>>> Later something similar appears:
>>>> TABLE NO.  1
>>>>   COL1        COL2        COL3        COL4 COL5        COL6        
>>>> COL7        COL8        COL9 COL10       COL11       COL12
>>>>    1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00  
>>>> 1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  
>>>> 0.0000E+00  0.0000E+00
>>>>    1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00  
>>>> 1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00 1.9281E-08  
>>>> 0.0000E+00  0.0000E+00
>>>>    1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00  
>>>> 1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00 5.1115E-13  
>>>> 0.0000E+00  0.0000E+00
>>>>
>>>> I will use the term "problematic lines" to refer to the repeated 
>>>> occurrences of the two non-data lines
>>>>
>>>> read.table is not successful in reading the table because of these 
>>>> problematic lines (I get around the first "TABLE NO." line using 
>>>> the skip option)
>>>>
>>>> My word-around has been to:
>>>>     1.  read the table with readLines
>>>>     2.  remove the problematic lines
>>>>     3.  write the file to disk
>>>>     4.  read the file with read.table.
>>>> However, this process is slow.
>>>>
>>>> I though about using "comment.char" as a means of avoiding reading 
>>>> the problematic lines.  However, comment.char does not accept ="[A-Z]"
>>>>
>>>> Are there any clever workarounds for this?
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Dennis Fisher MD
>>>> P < (The "P Less Than" Company)
>>>> Phone: 1-866-PLessThan (1-866-753-7784)
>>>> Fax: 1-866-PLessThan (1-866-753-7784)
>>>> www.PLessThan.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>