[R] Large database help

Rogerio Porto rdporto1 at terra.com.br
Tue May 16 03:33:45 CEST 2006


Hello all.

I have a large .txt file whose variables are fixed-columns, 
ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
This is a 60GB file with 90 variables and 60 million observations.

I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
I tried the following code just to see if I could work with 2 variables
but it seems not possible: 
R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.2.1  (2005-12-20 r36812)
ISBN 3-900051-07-0
> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 169011  4.6     350000  9.4   350000  9.4
Vcells  62418  0.5     786432  6.0   289957  2.3
> memory.limit(size=4090)
NULL
> memory.limit()
[1] 4288675840
> system.time(a<-matrix(runif(1e6),nrow=1))
[1] 0.28 0.02 2.42   NA   NA
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  171344  4.6     350000  9.4   350000  9.4
Vcells 1063212  8.2    3454398 26.4  4063230 31.0
> rm(a)
> ls()
character(0)
> system.time(a<-matrix(runif(60e6),nrow=1))
Error: not possible to alocate vector of size 468750 Kb
Timing stopped at: 7.32 1.95 83.55 NA NA 
> memory.limit(size=5000)
Erro em memory.size(size) : .....4GB

So my questions are:
1) (newbie) how can I read fixed-columns text files like this?
2) is there a way I can analyze (statistics like correlations, cluster etc)
    such a large database neither increasing RAM nor changing to 64bit
    machine but still using R and not using a sample? How? 

Thanks in advance.

Rogerio.




More information about the R-help mailing list