[R] select data from large CSV file

Stephen C. Upton supton at referentia.com
Thu Jul 5 17:40:20 CEST 2007


Hi Lars,

I haven't tried this, but I believe there were a couple of messages on 
the list recently on reading large files that basically used scan with 
connections, and reading in by blocks.

see ?scan, ?connections

HTH
steve

Lars Modig wrote:
> Hello
>
>
> I’ve got a large CSV file (>500M) with statistical data. It’s devided in
> 12 columns and I don’t know how many lines.
> The second column is the date and the second is a unique code for the
> location, the rest is (lets say different whether data.  See example
> below.
> 070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
> 070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
> 070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
> 070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y
>> First I tried with data <- read.csv. and of course the memory got full.
> Then I found in the archive that you could use scan. So then I wrote the
> following lines below to search for location and store one location with
> all different data in one variable.
>
> # collect the different pnc's
>  b=2                                        #compare from second number
>  alike=TRUE                                 #Dim alike like a boolean
>  stored = 910286609                         #first number is known
>   for(i in 1: 100){                         #start counting and scaning
>      data_final <- matrix(unlist(scan("C:/Documents and
> Settings/modiglar/Desktop/temp/et.csv",sep="," ,
> what=list("","","","","","","","","","","",""), skip=i ,
> n=12)),ncol=12, byrow=TRUE)
>
>
>       a=1                                     #compare from the 1:th stored
>       while( a < b){                          #---
>                                               #
>         if(as.numeric(data_final[2] != stored[a])) #compare
>           { a=a+1                                  #
>           alike=FALSE  }                           #
>         else{                                      #
>            alike=TRUE                              #
>            break }                                 #
>       }                                            # ---
>
>       if (alike==FALSE){                           #
>          stored[b]=as.numeric(data_final[2])       # Store new data
>          b=b+1                                     #
>       }
>   }
>
> #------------------------------------------------------------
> # save 1 pnc at the time
> d=1
> saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
> save_data_nr = 1                               #Stored number
>   for(i in 1: 100){                            #start counting and scaning
>      data_final <- matrix(unlist(scan("C:/Documents and
> Settings/modiglar/Desktop/temp/et.csv",sep="," ,
> what=list("","","","","","","","","","","",""), skip=i ,
> n=12)),ncol=12, byrow=TRUE)
>
>
>       if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
>         { saved_data[,d] <-  matrix(unlist(data_final),ncol=12,
> byrow=TRUE)  #Store new data
>          d=d+1 }                                         #
>                                                          #
>                                                          #
>  }
> As you can see I’m not so familiar with R, and therefore I have probably
> done this the wrong way.
>
> As I understand when running this, is that scan opens up the file count
> down to the line that should be read and read it, then closing the file
> again. So when I’m starting to come to line number at 10000 then it
> starting to take time. I let the computer run over night, but it was still
> far from finished when I stopped the loop.
>
> So how should I do this? Maybe I also need to sort on the date, and that
> is hopefully in order so then you should be able to cut the file every
> time you hit a new month but that will also take time if I do it like
> this.
>
> Thank you for your help in advance.
>
> Lars
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>



More information about the R-help mailing list