[R] How to deal with more than 6GB dataset using R?

Matthew Keller mckellercran at gmail.com
Tue Jul 27 22:25:24 CEST 2010


I've found that opening a connection, and scanning (in a loop)
line-by-line, is far faster than either read.table or read.fwf. E.g,
here's a file (temp2) that has 1500 rows and 550K columns:

showConnections(all=TRUE)
con <- file("temp2",open='r')
system.time({
for (i in 0:(num.samp-1)){
  new.gen[i+1,] <- scan(con,what='integer',nlines=1)}
})
close(con)
#THIS TAKES 4.6 MINUTES




system.time({
new.gen2 <- read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols))
})
#THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT)


This seems surprising to me. Can anyone see some other way to speed
this type of thing up?

Matt


On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> You may want to look at the biglm package as another way to regression models on very large data sets.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of babyfoxlove1 at sina.com
>> Sent: Friday, July 23, 2010 10:10 AM
>> To: r-help at r-project.org
>> Subject: [R] How to deal with more than 6GB dataset using R?
>>
>>  Hi there,
>>
>> Sorry to bother those who are not interested in this problem.
>>
>> I'm dealing with a large data set, more than 6 GB file, and doing
>> regression test with those data. I was wondering are there any
>> efficient ways to read those data? Instead of just using read.table()?
>> BTW, I'm using a 64bit version desktop and a 64bit version R, and the
>> memory for the desktop is enough for me to use.
>> Thanks.
>>
>>
>> --Gin
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com



More information about the R-help mailing list