[R] How long does skipping in read.table take

Gabor Grothendieck ggrothendieck at gmail.com
Sat Oct 23 04:36:31 CEST 2010


On Fri, Oct 22, 2010 at 6:41 PM, Mike Marchywka <marchywka at hotmail.com> wrote:
>> From: ggrothendieck at gmail.com
>> Date: Fri, 22 Oct 2010 18:28:14 -0400
>> To: dimitri.liakhovitski at gmail.com
>> CC: r-help at r-project.org
>> Subject: Re: [R] How long does skipping in read.table take
>>
>> On Fri, Oct 22, 2010 at 5:17 PM, Dimitri Liakhovitski
>>  wrote:
>> > I know I could figure it out empirically - but maybe based on your
>> > experience you can tell me if it's doable in a reasonable amount of
>> > time:
>> > I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
>> > I can't read it all in (there are many strings). So I thought I could
>> > read it in in parts (e.g., 1 milllion) using nrows= and skip.
>> > I was able to read in the first 1,000,000 rows no problem in 45 sec.
>> > But then I tried to skip 16,999,999 rows and then read in things. Then
>> > R crashed. Should I try again - or is it too many rows to skip for R?
>> >
>>

What we are doing is not related to that.  Its simply a matter that
the default backend to sqldf, sqlite, can be faster than R and can
handle larger datasets too (since sqlite does not have to store it all
in memory like R does) so pushing as much as one can onto sqlite and
then just grabbing what you need into R at the end can circumvent
bottlenecks in R.  Since its just a matter of writing one line of
code, read.csv.sql vs. read.csv its relatively simple to try it out.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list