[R] slow computation of functions over large datasets

Ken vicvoncastle at gmail.com
Wed Aug 3 21:05:59 CEST 2011


Sorry about the lack of code, but using Davids example, would:
tapply(itemPrice, INDEX=orderID, FUN=sum)
work?
  -Ken Hutchison

On Aug 3, 2554 BE, at 2:09 PM, David Winsemius <dwinsemius at comcast.net> wrote:

> 
> On Aug 3, 2011, at 2:01 PM, Ken wrote:
> 
>> Hello,
>> Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
>>            -Ken Hutchison
> 
> Got any code? The OP offered a reproducible example, after all.
> 
> -- 
> David.
>> 
>> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>> 
>>> 
>>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>> 
>>>> This takes about 2 secs for 1M rows:
>>>> 
>>>>> n <- 1000000
>>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
>>>>> require(data.table)
>>>>> # convert to data.table
>>>>> ed.dt <- data.table(exampledata)
>>>>> system.time(result <- ed.dt[
>>>> +                         , list(total = sum(itemPrice))
>>>> +                         , by = orderID
>>>> +                         ]
>>>> +            )
>>>> user  system elapsed
>>>> 1.30    0.05    1.34
>>> 
>>> Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum.
>>> 
>>> -- 
>>> David.
>>>>> 
>>>>> str(result)
>>>> Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
>>>> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
>>>> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>>>>> head(result)
>>>>  orderID total
>>>> [1,]       1    49
>>>> [2,]       2    37
>>>> [3,]       3    72
>>>> [4,]       4    92
>>>> [5,]       5    50
>>>> [6,]       6    76
>>>>> 
>>>> 
>>>> 
>>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>>> <caroline.faisst at gmail.com> wrote:
>>>>> Hello there,
>>>>> 
>>>>> 
>>>>> I’m computing the total value of an order from the price of the order items
>>>>> using a “for” loop and the “ifelse” function. I do this on a large dataframe
>>>>> (close to 1m lines). The computation of this function is painfully slow: in
>>>>> 1min only about 90 rows are calculated.
>>>>> 
>>>>> 
>>>>> The computation time taken for a given number of rows increases with the
>>>>> size of the dataset, see the example with my function below:
>>>>> 
>>>>> 
>>>>> # small dataset: function performs well
>>>>> 
>>>>> exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>> 
>>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>> 
>>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>>> {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>> 
>>>>> 
>>>>> # large dataset: the very same computational task takes much longer
>>>>> 
>>>>> exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>> 
>>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>> 
>>>>> system.time(for (i in 2:9)
>>>>> {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>> 
>>>>> 
>>>>> 
>>>>> Does someone know a way to increase the speed?
>>>>> 
>>>>> 
>>>>> Thank you very much!
>>>>> 
>>>>> Caroline
>>>>> 
>>>>>     [[alternative HTML version deleted]]
>>>>> 
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jim Holtman
>>>> Data Munger Guru
>>>> 
>>>> What is the problem that you are trying to solve?
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> David Winsemius, MD
>>> West Hartford, CT
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius, MD
> West Hartford, CT
> 



More information about the R-help mailing list