[R] select data from large CSV file

Lars Modig eb99lamo at kth.se
Thu Jul 5 13:54:10 CEST 2007


Hello


I’ve got a large CSV file (>500M) with statistical data. It’s devided in
12 columns and I don’t know how many lines.
The second column is the date and the second is a unique code for the
location, the rest is (lets say different whether data.  See example
below.
070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y


First I tried with data <- read.csv. and of course the memory got full.
Then I found in the archive that you could use scan. So then I wrote the
following lines below to search for location and store one location with
all different data in one variable.

# collect the different pnc's
 b=2                                        #compare from second number
 alike=TRUE                                 #Dim alike like a boolean
 stored = 910286609                         #first number is known
  for(i in 1: 100){                         #start counting and scaning
     data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""), skip=i ,
n=12)),ncol=12, byrow=TRUE)


      a=1                                     #compare from the 1:th stored
      while( a < b){                          #---
                                              #
        if(as.numeric(data_final[2] != stored[a])) #compare
          { a=a+1                                  #
          alike=FALSE  }                           #
        else{                                      #
           alike=TRUE                              #
           break }                                 #
      }                                            # ---

      if (alike==FALSE){                           #
         stored[b]=as.numeric(data_final[2])       # Store new data
         b=b+1                                     #
      }
  }

#------------------------------------------------------------
# save 1 pnc at the time
d=1
saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
save_data_nr = 1                               #Stored number
  for(i in 1: 100){                            #start counting and scaning
     data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""), skip=i ,
n=12)),ncol=12, byrow=TRUE)


      if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
        { saved_data[,d] <-  matrix(unlist(data_final),ncol=12,
byrow=TRUE)  #Store new data
         d=d+1 }                                         #
                                                         #
                                                         #
 }
As you can see I’m not so familiar with R, and therefore I have probably
done this the wrong way.

As I understand when running this, is that scan opens up the file count
down to the line that should be read and read it, then closing the file
again. So when I’m starting to come to line number at 10000 then it
starting to take time. I let the computer run over night, but it was still
far from finished when I stopped the loop.

So how should I do this? Maybe I also need to sort on the date, and that
is hopefully in order so then you should be able to cut the file every
time you hit a new month but that will also take time if I do it like
this.

Thank you for your help in advance.

Lars



More information about the R-help mailing list