[R] How to separate huge dataset into chunks

Thu Mar 26 08:34:47 CET 2009

On Wed, 25 Mar 2009, Guillaume Filteau wrote:

> Hello Thomas,
>
> Thanks for your help!
>
> Sadly your code does not work for the last chunk, because its length is shorter 
> than nrows.
>

You just need to move the test to the bottom of the loop

       repeat{
        chunk<-read.table(conn, nrows=10000,col.names=nms)
          ## do something to the chunk
        if(length(chunk)<10000) break
       }

>
>
> Quoting Thomas Lumley <tlumley at u.washington.edu>:
>
>> On Tue, 24 Mar 2009, Guillaume Filteau wrote:
>> 
>>> Hello all,
>>> 
>>> I’m trying to take a huge dataset (1.5 GB) and separate it into smaller 
>>> chunks with R.
>>> 
>>> So far I had nothing but problems.
>>> 
>>> I cannot load the whole dataset in R due to memory problems. So, I instead 
>>> try to load a few (100000) lines at a time (with read.table).
>>> 
>>> However, R kept crashing (with no error message) at about the 6800000 
>>> line. This is extremely frustrating.
>>> 
>>> To try to fix this, I used connections with read.table. However, I now get 
>>> a cryptic error telling me “no lines available in input”.
>>> 
>>> Is there any way to make this work?
>>> 
>> 
>> There might be an error in line 42 of your script. Or somewhere else. The 
>> error message is cryptically saying that there were no lines of text 
>> available in the input connection, so presumably the connection wasn't 
>> pointed at your file correctly.
>> 
>> It's hard to guess without seeing what you are doing, but
>>    conn <- file("mybigfile", open="r")
>>    chunk<- read.table(conn, header=TRUE, nrows=10000)
>>    nms <- names(chunk)
>>    while(length(chunk)==10000){
>>       chunk<-read.table(conn, nrows=10000,col.names=nms)
>>       ## do something to the chunk
>>    }
>>    close(conn)
>> 
>> should work. This sort of thing certainly does work routinely.
>> 
>> It's probably not worth reading 100,000 lines at a time unless your computer 
>> has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce 
>> much extra overhead and may well increase the speed by reducing memory use.
>> 
>>     -thomas
>> 
>> Thomas Lumley			Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu	University of Washington, Seattle
>> 
>> 
>> 
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle