[R] How to make this for() loop memory efficient?

Wed Jan 11 00:31:10 CET 2012

On Wed, 11 Jan 2012, iliketurtles wrote:
> ##I have 2 columns of data. The first column is unique "event IDs" that
> represent a phone call made to a customer.
> ###So, if you see 3 entries together in the first column like follows:
> 
> matrix(c("call1a","call1a","call1a") )
> 
> ##then this means that this particular phone call  (the first call that's
> logged in the data set) was transferred
> ##between 3 different "modules" before the call was terminated.
> 
> ##The second column is a numerical description of the module the call
> started with and then got transferred to prior to ##call termination. Now,
> I'll construct a ##representative array of the type of data I'm dealing
> with (the real data set goes ##on for X00,000s of rows):
> ##(Ignore how I construct the following array, it’s completely unrelated to
> how the actual data set was constructed).
> 
> 
> a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")})
> development.a<-seq(1,40,3)
> development.a2<-seq(1,40,5)
> a[development.a]<-a[development.a+1]
> a[development.a2]<-a[development.a2+1]
> a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-"ca
> ll9a"
> b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050
> ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9300
> 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,92
> 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,
> 920009,960500,970050,930009,940010,960500,960500,960500)
> data<-as.data.frame(cbind(a,b))
> colnames(data)<-c("phone calls","modules")
> dim(data)
> print(data[1:10,]) #sample of 10 rows
> 
> # Note that in the real data set, data[,2] ranges from 810,000 to 999,999.
> I've been tasked with the following:
> # "For each phone call that BEGINS with the module which is denoted by 81
> (i.e. of the form 81X,XXX), what is the expected number of modules in these
> calls?"
> #Then it's the same question for each module beginning with 82, 83, 84.....
> all the way until 99.
> #I've created code that I think works for this, but I can't actually run it
> on the whole data set. I left it for 30 minutes and it only had about #5%
> of the task completed (I clicked "STOP" then checked my output to see if I
> did it properly, and it seems correct).
> #I know the apply() family specializes in vector operations, but I can't
> figure out how to complete the above question in any way other than #loops.
> 
> L<-data
> 
> A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1)
> A<-data.frame(A)
> 
>  for(i in 1:(nrow(L)-1))
>  {
>   if(L[(i+1),1]!=L[i,1])
>   {
> 
> A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1
> ]<- {
> 
> A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1
> ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number
> of modules in the calls that begin with XX (not yet averaged).
>     }
> 
> A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2
> ]<- {
> 
> A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2
> ]+1 }
>   }
> 
>  }
> 
> #If I can get this code to be more memory efficient such that I can do it
> on a 400,000 row data set, I can do, for example,
> 
> A[17,1]/A[17,2]
> 
> #and I'll arrive at the mean number of modules per call where the call
> starts with a module that starts with 97.
> 
> A[17,1]
> #is 10, which means that, out of every single call that started with a
> module of 97X,XXX,
> #they went through 10 modules in total.
> 
> A[17,2]
> #is 6, which means that there was 6 calls in total that began with a
> 97X,XXX module.
> 
> #Hence,
> 
> 
> A[17,1]/A[17,2]
> 
> #is the average number of modules that were executed in all the calls that
> began with a 97X,XXX module.
> 
> 
> -----
> ----
> 
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS

I don't see any need for you to use data frames.

If you make A and data (not a good use of a variable name) just matrices, you get the same 
answers at about 10 times the speed (using your example).

Hope this helps,
Ray Brownrigg