[R] Reading large files

Saptarshi Guha saptarshi.guha at gmail.com
Sun Feb 7 08:24:41 CET 2010


Hello,
Do you need /all/ the data in memory at one time? Is your goal to
divide the data (e.g according to some factor /or/ some function of
the columns of data set ) and then analyze the divisions? And then,
possibly, combine the results ?
If so, you might consider using Rhipe. We have analyzed (e.g get
regression parameters, apply algorithms) across subsets of data where
the subsets are created according to some condition.
Using this approach(and a cluster of 8 machines, 72 cores) we
successfully analyzed data sets ranging from 14GB to ~140GB .
This all assumes  that your divisions are suitably small - i notice
you mention that each region is 10-20 GB and you want to compute on
/all/ i.e you need all of it in memory. If so, Rhipe cannot help you.


Regards
Saptarshi



On Thu, Feb 4, 2010 at 8:27 PM, Vadlamani, Satish {FLNA}
<SATISH.VADLAMANI at fritolay.com> wrote:
> Folks:
> I am trying to read in a large file. Definition of large is:
> Number of lines: 333, 250
> Size: 850 MB
>
> The maching is a dual core intel, with 4 GB RAM and nothing else running on it. I read the previous threads on read.fwf and did not see any conclusive statements on how to read fast. Example record and R code given below. I was hoping to purchase a better machine and do analysis with larger datasets - but these preliminary results do not look good.
>
> Does anyone have any experience with large files (> 1GB) and using them with Revolution-R?
>
>
> Thanks.
>
> Satish
>
> Example Code
> key_vec <- c(1,3,3,4,2,8,8,2,2,3,2,2,1,3,3,3,3,9)
> key_names <- c("allgeo","area1","zone","dist","ccust1","whse","bindc","ccust2","account","area2","ccust3","customer","allprod","cat","bu","class","size","bdc")
> key_info <- data.frame(key_vec,key_names)
> col_names <- c(key_names,sas_time$week)
> num_buckets <- rep(12,209)
> width_vec = c(key_vec,num_buckets)
> col_classes<-c(rep("factor",18),rep("numeric",209))
> #threewkoutstat <- read.fwf(file="3wkoutstatfcst_file02.dat",widths=width_vec,header=FALSE,colClasses=col_classes,n=100)
> threewkoutstat <- read.fwf(file="3wkoutstatfcst_file02.dat",widths=width_vec,header=FALSE,colClasses=col_classes)
> names(threewkoutstat) <- col_names
>
> Example record (only one record pasted below)
> A004001003799000049250000492599990049999A001002002015002015009        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00   !
>      0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.60        0.60        0.60        0.70        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00       !
>  0.00        0.00        0.00        0.00        0.00        0.00
>   0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


More information about the R-help mailing list