[R] read.delim very slow in reading files with lots of columns

Benilton Carvalho bcarvalh at jhsph.edu
Fri Sep 25 21:26:38 CEST 2009


or that! :-D thanks jim.
b

On Sep 25, 2009, at 3:57 PM, jim holtman wrote:

> Here is how much time it took to read a file with 10 lines and 700,000
> columns per line separated with comma:
>
>> system.time(input <- scan("/tempxx.txt", what=0, sep=','))
> Read 7000000 items
>   user  system elapsed
>  15.62    0.22   15.84
>> object.size(input)
> 56000024 bytes
>>
>
> 'scan' should be sufficient and it will not take another 10 minutes  
> in awk.
>
> On Fri, Sep 25, 2009 at 1:17 PM, Charles C. Berry <cberry at tajo.ucsd.edu 
> > wrote:
>> On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote:
>>
>>> Thanks, Ben.
>>>
>>> The matrix is a pure numeric matrix (6x700000, 31mb).
>>> I tried the colClasses='numeric' as well as nrows=7(one of these  
>>> is header
>>> line) on the matrix.
>>> Also I tested it with not setting the two options in read.delim()
>>
>>
>> A couple of things come to mind.
>>
>> First, I have not read the internals of scan, but suspect that  
>> parsing a
>> really long line may be slowing things down.
>>
>> Since you are attempting to read in a numeric matrix, you can  
>> simply do a
>> global replacement of your delimiter with a newline and use scan on  
>> the
>> result. On unix-like systems, something like
>>
>>       tmp <- scan( pipe( 'tr "\t" "\n"  < test_data.txt' ) )
>>
>> ought to help.
>>
>> Second, the memory occupied by each line - once it has been  
>> processed - is
>> spread over the full 32MB (or 3.2 GB for the 600 by 700000 version)  
>> region
>> of memory. I am guessing that this is causing your cache to work  
>> hard to put
>> it in place.
>>
>> If you really want the result to be a 600 by 700000 matrix, you  
>> might try to
>> read it in smaller blocks using scan( pipe( "cut ... " ) ) to feed  
>> selected
>> blocks of columns of your text file to R.
>>
>> HTH,
>>
>> Chuck
>>
>>
>>>
>>> Here is the time spent on reading the matrix for each test.
>>>
>>>> system.time( tmp <- read.delim("test_data.txt"))
>>>
>>>   user    system   elapsed
>>> 50985.421    27.665 51013.384
>>>
>>>> system.time(tmp <-
>>>> read 
>>>> .delim 
>>>> ("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))
>>>
>>>   user    system   elapsed
>>> 51301.563    60.491 51362.208
>>>
>>> It seems setting the options does not speed up the reading at all.
>>> Is it because of the header line? I will test it.
>>> Did I misunderstand something?
>>>
>>> One additional and interesting observation:
>>> The one with the options does save memory a lot. It took ~150mb,  
>>> while the
>>> other took ~4GB for reading the matrix.
>>>
>>> I will try the scan() and see if it helps.
>>>
>>> Thanks!
>>> Mike
>>>
>>>
>>> -----Original Message-----
>>> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
>>> Sent: Wednesday, September 23, 2009 4:56 PM
>>> To: Ping-Hsun Hsieh
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] read.delim very slow in reading files with lots of
>>> columns
>>>
>>> use the 'colClasses' argument and you can also set 'nrows'.
>>>
>>> b
>>>
>>> On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am trying to read a tab-delimited file into R (Ver. 2.8). The
>>>> machine I am using is 64bit Linux with 16 GB.
>>>>
>>>> The file is basically a matrix(~600x700000) and as large as 3GB.
>>>>
>>>>
>>>>
>>>> The read.delim() ran extremely slow (hours) even with a subset of
>>>> the file (31 MB with 6x700000)
>>>>
>>>> I monitored the memory usage, and found it constantly only took  
>>>> less
>>>> than 1% of 16GB memory.
>>>>
>>>> Does read.delim() have difficulty to read files with lots of  
>>>> columns?
>>>>
>>>> Any suggestions?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> Charles C. Berry                            (858) 534-2098
>>                                           Dept of Family/Preventive
>> Medicine
>> E mailto:cberry at tajo.ucsd.edu               UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego  
>> 92093-0901
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list