[R] read big text file into R

Yupu Liang liang at cbio.mskcc.org
Thu Aug 23 20:29:24 CEST 2007


Dear Rs:

Hi, I am trying to read a big text file (nrows=243440, ncols=144). It  
seems the computational time of all the read methods 
(scan,readtable,read.delim) is not linear to the number of rows I  
want to read in: things became really slow once I tried to read in  
100000 lines compare to 10000 lines).

If I am reading the profiling result right, I guess scan wouldn't  
help either.

My questions are :
1) Is this a memory issue?
2) How to get around this?: I can't just sit around for 15 mins.  
Would write a c function help?

Thanks!

Here is the profiling I did:

 > Rprof()
 > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=10000)
 > Rprof(NULL)
 > summaryRprof()
$by.self
                self.time self.pct total.time total.pct
"scan"              3.56     85.2       3.56      85.2
"type.convert"      0.48     11.5       0.48      11.5
"read.table"        0.08      1.9       4.18     100.0
"make.names"        0.02      0.5       0.02       0.5
"options"           0.02      0.5       0.02       0.5
"readLines"         0.02      0.5       0.02       0.5
"read.delim"        0.00      0.0       4.18     100.0
"file"              0.00      0.0       0.02       0.5
"getOption"         0.00      0.0       0.02       0.5

$by.total
                total.time total.pct self.time self.pct
"read.table"         4.18     100.0      0.08      1.9
"read.delim"         4.18     100.0      0.00      0.0
"scan"               3.56      85.2      3.56     85.2
"type.convert"       0.48      11.5      0.48     11.5
"make.names"         0.02       0.5      0.02      0.5
"options"            0.02       0.5      0.02      0.5
"readLines"          0.02       0.5      0.02      0.5
"file"               0.02       0.5      0.00      0.0
"getOption"          0.02       0.5      0.00      0.0

$sampling.time
[1] 4.18

 > ?Rprof()
 > Rprof()
 > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=100000)
 > Rprof(NULL)
 > summaryRprof()
$by.self
                  self.time self.pct total.time total.pct
"scan"              143.12     92.7     143.12      92.7
"type.convert"        9.52      6.2       9.52       6.2
"read.table"          1.60      1.0     154.28      99.9
"paste"               0.02      0.0       0.08       0.1
"textConnection"      0.02      0.0       0.04       0.0
".deparseOpts"        0.02      0.0       0.02       0.0
"file"                0.02      0.0       0.02       0.0
"make.names"          0.02      0.0       0.02       0.0
"print.default"       0.02      0.0       0.02       0.0
"read.delim"          0.00      0.0     154.28      99.9
"doTryCatch"          0.00      0.0       0.08       0.1
"gsub"                0.00      0.0       0.08       0.1
"try"                 0.00      0.0       0.08       0.1
"tryCatch"            0.00      0.0       0.08       0.1
"tryCatchList"        0.00      0.0       0.08       0.1
"tryCatchOne"         0.00      0.0       0.08       0.1
"capture.output"      0.00      0.0       0.06       0.0
"deparse"             0.00      0.0       0.02       0.0
"eval.with.vis"       0.00      0.0       0.02       0.0
"evalVis"             0.00      0.0       0.02       0.0
"print"               0.00      0.0       0.02       0.0

$by.total
                  total.time total.pct self.time self.pct
"read.table"         154.28      99.9      1.60      1.0
"read.delim"         154.28      99.9      0.00      0.0
"scan"               143.12      92.7    143.12     92.7
"type.convert"         9.52       6.2      9.52      6.2
"paste"                0.08       0.1      0.02      0.0
"doTryCatch"           0.08       0.1      0.00      0.0
"gsub"                 0.08       0.1      0.00      0.0
"try"                  0.08       0.1      0.00      0.0
"tryCatch"             0.08       0.1      0.00      0.0
"tryCatchList"         0.08       0.1      0.00      0.0
"tryCatchOne"          0.08       0.1      0.00      0.0
"capture.output"       0.06       0.0      0.00      0.0
"textConnection"       0.04       0.0      0.02      0.0
".deparseOpts"         0.02       0.0      0.02      0.0
"file"                 0.02       0.0      0.02      0.0
"make.names"           0.02       0.0      0.02      0.0
"print.default"        0.02       0.0      0.02      0.0
"deparse"              0.02       0.0      0.00      0.0
"eval.with.vis"        0.02       0.0      0.00      0.0
"evalVis"              0.02       0.0      0.00      0.0
"print"                0.02       0.0      0.00      0.0

$sampling.time
[1] 154.36

I am using R 2.5.1 for mac on a Dual 2 GHz PowerPC G5 with 1GB memory.



More information about the R-help mailing list