[R] slow computation of functions over large datasets

Wed Aug 3 19:12:30 CEST 2011

On Aug 3, 2011, at 12:20 PM, jim holtman wrote:

> This takes about 2 secs for 1M rows:
>
>> n <- 1000000
>> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace  
>> = TRUE), itemPrice = rpois(n, 10))
>> require(data.table)
>> # convert to data.table
>> ed.dt <- data.table(exampledata)
>> system.time(result <- ed.dt[
> +                         , list(total = sum(itemPrice))
> +                         , by = orderID
> +                         ]
> +            )
>   user  system elapsed
>   1.30    0.05    1.34

Interesting. Impressive. And I noted that the OP wanted what cumsum  
would provide and for some reason creating that longer result is even  
faster on my machine than the shorter result using sum.

-- 
David.
>>
>> str(result)
> Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>> head(result)
>     orderID total
> [1,]       1    49
> [2,]       2    37
> [3,]       3    72
> [4,]       4    92
> [5,]       5    50
> [6,]       6    76
>>
>
>
> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
> <caroline.faisst at gmail.com> wrote:
>> Hello there,
>>
>>
>> I’m computing the total value of an order from the price of the  
>> order items
>> using a “for” loop and the “ifelse” function. I do this on a large  
>> dataframe
>> (close to 1m lines). The computation of this function is painfully  
>> slow: in
>> 1min only about 90 rows are calculated.
>>
>>
>> The computation time taken for a given number of rows increases  
>> with the
>> size of the dataset, see the example with my function below:
>>
>>
>> # small dataset: function performs well
>>
>> exampledata<- 
>> data 
>> .frame 
>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>
>> system.time(for (i in 2:length(exampledata[,1]))
>> {exampledata[i,"orderAmount"]<- 
>> ifelse 
>> (exampledata 
>> [i 
>> ,"orderID 
>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>
>>
>> # large dataset: the very same computational task takes much longer
>>
>> exampledata2<- 
>> data 
>> .frame 
>> (orderID 
>> = 
>> c 
>> (1,1,1,2,2,3,3,3,4,5 
>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>
>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>
>> system.time(for (i in 2:9)
>> {exampledata2[i,"orderAmount"]<- 
>> ifelse 
>> (exampledata2 
>> [i 
>> ,"orderID 
>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>
>>
>>
>> Does someone know a way to increase the speed?
>>
>>
>> Thank you very much!
>>
>> Caroline
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> -- 
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT