[R] slow computation of functions over large datasets

ONKELINX, Thierry Thierry.ONKELINX at inbo.be
Wed Aug 3 15:59:08 CEST 2011


Dear Caroline,

Here is a faster and more elegant solution.

> n <- 10000
> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
> library(plyr)
> system.time({
+ 	ddply(exampledata, .(orderID), function(x){
+ 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))
+ 	})
+ })
   user  system elapsed 
   1.67    0.00    1.69 
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
> system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
   user  system elapsed 
  11.94    0.02   11.97

Best regards,

Thierry
> -----Oorspronkelijk bericht-----
> Van: r-help-bounces op r-project.org [mailto:r-help-bounces op r-project.org]
> Namens Caroline Faisst
> Verzonden: woensdag 3 augustus 2011 15:26
> Aan: r-help op r-project.org
> Onderwerp: [R] slow computation of functions over large datasets
> 
> Hello there,
> 
> 
> I'm computing the total value of an order from the price of the order items using
> a "for" loop and the "ifelse" function. I do this on a large dataframe (close to
> 1m lines). The computation of this function is painfully slow: in 1min only about
> 90 rows are calculated.
> 
> 
> The computation time taken for a given number of rows increases with the size
> of the dataset, see the example with my function below:
> 
> 
> # small dataset: function performs well
> 
> exampledata<-
> data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
> 
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
> 
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<-
> ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
> 1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
> 
> 
> # large dataset: the very same computational task takes much longer
> 
> exampledata2<-
> data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
> 0,1,9,7,25:2000020))
> 
> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
> 
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<-
> ifelse(exampledata2[i,"orderID"]==exampledata2[i-
> 1,"orderID"],exampledata2[i-
> 1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
> 
> 
> 
> Does someone know a way to increase the speed?
> 
> 
> Thank you very much!
> 
> Caroline
> 
> 	[[alternative HTML version deleted]]



More information about the R-help mailing list