[R] How to deal with more than 6GB dataset using R?

jim holtman jholtman at gmail.com
Tue Jul 27 23:50:50 CEST 2010


It all depends on what you are going with the data.  First in your
scan example, I would not read in a line at a time, but probably
several thousand and then process the data.  Most of your time is
probably spent in reading.  I assume that you are not reading it all
in at once (but then maybe you are since you have a 64-bit version).
It is also good to understand what read.fwf is doing.  It is reading
in the file, parsing it by columns, writing it with a separator to a
temporary file and then reading that file in with read.table to get
the final result -- that is one of the reasons it is taking so long.

You might also consider putting the data into a database and then
reading the required instances out of there.  But it is hard to give
specific advice since we don't know how you want to with the data.
But in any case at least read a good portion (several MBs at a time)
to get the economy of scale and not a line at a time.

Here is an example of reading in a csv file with 666,000 lines at 1
line per 'scan', 10 lines, 1000 lines and 10000 lines.  Notice that at
nlines=1 it take 30 CPU seconds to process the data; at nlines=1000,
it take 2.8 (10X faster).  So time various options to see what
happens.

> input <- file(file, 'r')
> n <- 1  # lines to read
> system.time({
+ repeat{
+     lines <- scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+     if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
  29.52    0.08   29.90
> close(input)
> input <- file(file, 'r')
> n <- 10  # lines to read
> system.time({
+ repeat{
+     lines <- scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+     if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
   5.93    0.00    5.99
> close(input)
> input <- file(file, 'r')
> n <- 1000  # lines to read
> system.time({
+ repeat{
+     lines <- scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+     if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
   2.79    0.08    2.90
> close(input)
> n <- 10000  # lines to read
> system.time({
+ repeat{
+     lines <- scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+     if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
   2.76    0.00    2.76
> close(input)
>
>
>
>


On Tue, Jul 27, 2010 at 4:25 PM, Matthew Keller <mckellercran at gmail.com> wrote:
> I've found that opening a connection, and scanning (in a loop)
> line-by-line, is far faster than either read.table or read.fwf. E.g,
> here's a file (temp2) that has 1500 rows and 550K columns:
>
> showConnections(all=TRUE)
> con <- file("temp2",open='r')
> system.time({
> for (i in 0:(num.samp-1)){
>  new.gen[i+1,] <- scan(con,what='integer',nlines=1)}
> })
> close(con)
> #THIS TAKES 4.6 MINUTES
>
>
>
>
> system.time({
> new.gen2 <- read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols))
> })
> #THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT)
>
>
> This seems surprising to me. Can anyone see some other way to speed
> this type of thing up?
>
> Matt
>
>
> On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow <Greg.Snow at imail.org> wrote:
>> You may want to look at the biglm package as another way to regression models on very large data sets.
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> Statistical Data Center
>> Intermountain Healthcare
>> greg.snow at imail.org
>> 801.408.8111
>>
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>> project.org] On Behalf Of babyfoxlove1 at sina.com
>>> Sent: Friday, July 23, 2010 10:10 AM
>>> To: r-help at r-project.org
>>> Subject: [R] How to deal with more than 6GB dataset using R?
>>>
>>>  Hi there,
>>>
>>> Sorry to bother those who are not interested in this problem.
>>>
>>> I'm dealing with a large data set, more than 6 GB file, and doing
>>> regression test with those data. I was wondering are there any
>>> efficient ways to read those data? Instead of just using read.table()?
>>> BTW, I'm using a 64bit version desktop and a 64bit version R, and the
>>> memory for the desktop is enough for me to use.
>>> Thanks.
>>>
>>>
>>> --Gin
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>> guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list