[R] slow computation of functions over large datasets

David Winsemius dwinsemius at comcast.net
Wed Aug 3 20:09:42 CEST 2011


On Aug 3, 2011, at 2:01 PM, Ken wrote:

> Hello,
>  Perhaps transpose the table attach(as.data.frame(t(data))) and use  
> ColSums() function with order id as header.
>             -Ken Hutchison

  Got any code? The OP offered a reproducible example, after all.

-- 
David.
>
> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius  
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>
>>> This takes about 2 secs for 1M rows:
>>>
>>>> n <- 1000000
>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,  
>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>> require(data.table)
>>>> # convert to data.table
>>>> ed.dt <- data.table(exampledata)
>>>> system.time(result <- ed.dt[
>>> +                         , list(total = sum(itemPrice))
>>> +                         , by = orderID
>>> +                         ]
>>> +            )
>>> user  system elapsed
>>> 1.30    0.05    1.34
>>
>> Interesting. Impressive. And I noted that the OP wanted what cumsum  
>> would provide and for some reason creating that longer result is  
>> even faster on my machine than the shorter result using sum.
>>
>> -- 
>> David.
>>>>
>>>> str(result)
>>> Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
>>> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
>>> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>>>> head(result)
>>>   orderID total
>>> [1,]       1    49
>>> [2,]       2    37
>>> [3,]       3    72
>>> [4,]       4    92
>>> [5,]       5    50
>>> [6,]       6    76
>>>>
>>>
>>>
>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>> <caroline.faisst at gmail.com> wrote:
>>>> Hello there,
>>>>
>>>>
>>>> I’m computing the total value of an order from the price of the  
>>>> order items
>>>> using a “for” loop and the “ifelse” function. I do this on a  
>>>> large dataframe
>>>> (close to 1m lines). The computation of this function is  
>>>> painfully slow: in
>>>> 1min only about 90 rows are calculated.
>>>>
>>>>
>>>> The computation time taken for a given number of rows increases  
>>>> with the
>>>> size of the dataset, see the example with my function below:
>>>>
>>>>
>>>> # small dataset: function performs well
>>>>
>>>> exampledata<- 
>>>> data 
>>>> .frame 
>>>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>
>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>
>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>> {exampledata[i,"orderAmount"]<- 
>>>> ifelse 
>>>> (exampledata 
>>>> [i 
>>>> ,"orderID 
>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>
>>>>
>>>> # large dataset: the very same computational task takes much longer
>>>>
>>>> exampledata2<- 
>>>> data 
>>>> .frame 
>>>> (orderID 
>>>> = 
>>>> c 
>>>> (1,1,1,2,2,3,3,3,4,5 
>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>
>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>
>>>> system.time(for (i in 2:9)
>>>> {exampledata2[i,"orderAmount"]<- 
>>>> ifelse 
>>>> (exampledata2 
>>>> [i 
>>>> ,"orderID 
>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>
>>>>
>>>>
>>>> Does someone know a way to increase the speed?
>>>>
>>>>
>>>> Thank you very much!
>>>>
>>>> Caroline
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>>
>>> -- 
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list