[R] Sum of column from another df based of row values of df1

Mon Sep 10 00:04:16 CEST 2012

Just to add, I did not know that the speed of data access is so much
different in matrix and dataframes. This is one for the future.

Thanks again Jim :)

-Shivam

On Mon, Sep 10, 2012 at 3:29 AM, Shivam <shivamsingh at gmail.com> wrote:
> Thanks a lot Jim, it works a treat. Just had to change the date format
> in the mCALL as well. But you saved me 80 hours of fretting and
> frustration. Really thankful for it.
>
> Regards,
> Shivam
>
> On Mon, Sep 10, 2012 at 1:33 AM, jim holtman <jholtman at gmail.com> wrote:
>> How about an improvement to 16 seconds.  The first thing to do is to
>> convert you data to a matrix because accessing data in a dataframe is
>> very expensive.  If you run Rprof on your code you will see that all
>> the time is spent in retrieving the information.  Converting to a
>> matrix and using matrix accessing is considerably faster.  I did
>> convert the POSIXct to Date.  You were also paying a lot in the
>> constant conversion of POSIXct to Date for your comparisons.  I just
>> replicated your CALL to 1 million rows for testing.
>>
>>
>>> CALL <-
>> + structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
>> + "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
>> + 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
>> + 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
>> + "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
>> + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
>> + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
>> + 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
>> + 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
>> + "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
>> + "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
>> + "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>>>
>>> VOL <-
>> + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
>> + 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
>> + ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
>> + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
>> + 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
>> + 6L), class = "data.frame")
>>>
>>> # convert to matrices for faster testing
>>> mCALL <- cbind(CALL$DATE, CALL$EXPDATE)
>>> mVOL <- cbind(as.Date(VOL$DATE), VOL$VOL)  # convert POSIXct to Date
>>>
>>> # create 1M rows in mCALL
>>> mCALL <- rbind(mCALL, mCALL[rep(1L, 1e6),])
>>>
>>> result <- numeric(nrow(mCALL))
>>> system.time({
>> +     for (i in 1:nrow(mCALL)){
>> +         result[i] <- sum(mVOL[(mVOL[, 1L] >= mCALL[i,1L])
>> +                             & (mVOL[, 1L] <= mCALL[i, 2L]), 2L])
>> +     }
>> + })
>>    user  system elapsed
>>   15.94    0.00   16.07
>>>
>>>
>>>
>>>
>>
>>
>> On Sun, Sep 9, 2012 at 2:58 PM, Shivam <shivamsingh at gmail.com> wrote:
>>> Dear All,
>>>
>>> I need to sum a column from another dataframe based on the row values
>>> of one dataframe. I am stuck in a loop trying to accomplish it and at
>>> current speed it will take more than 80 hours to complete. Needless to
>>> say I am looking for a more elegant/quicker solution. Really need some
>>> help here. Here is the issue:
>>>
>>> I have a dataframe CALL (the dput of head is given below) which has
>>> close to a million rows. There are 2 date columns which are of
>>> importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
>>> head given), which has 2 columns, DATE and VOL. It has the volatility
>>> corresponding to each day and it has a total of 124 records
>>> (corresponding to 6 months). I want to add another column in the CALL
>>> dataframe which would contain the sum of all the volatilities from the
>>> VOL df for the period specified by the interval of DATE and EXPDATE in
>>> each row of CALL df.
>>>
>>> For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
>>> '27-01-2011'. So I want the SUM column (A new column in CALL df) to
>>> contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01
>>> from the VOL dataframe.
>>>
>>> I have to repeat this process for all the rows in the dataframe. Here
>>> is the for-loop version of the solution:
>>>
>>> for (k in 1:nrow(CALL)){
>>> CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE
>>> <= CALL$EXPDATE[k]))
>>> }
>>>
>>> The loop will run for close to a million times, it has been running
>>> for more than 10 hours and its just 12% complete. It would take more
>>> than 80 hours to complete, not the mention the toll it would take on
>>> my laptop. So is there a better way that I can accomplish this task?
>>> Any input would be greatly appreciated. Below are the dput of the two
>>> dataframes.
>>>
>>> One point of note is that there are only 124 DISTINCT values of DATE
>>> and 6 DISTINCT values of EXPDATE, in case it can be used in some way.
>>>
>>>> dput(CALL)
>>> structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
>>> "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
>>> 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
>>> 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
>>> "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
>>> 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
>>> 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
>>> 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
>>> 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
>>> "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
>>> "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
>>> "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>>>
>>>
>>>> dput(VOL)
>>> structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
>>> 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
>>> ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
>>> 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
>>> 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
>>> 6L), class = "data.frame")
>>>
>>> Please do let me know if any more information from my side would help
>>> or if I need to explain the issue more clearly.
>>>
>>> Any minor improvement will be great help.
>>>
>>> Thanks in advance.
>>>
>>> -Shivam
>>>
>>>
>>> --
>>> *Victoria Concordia Crescit*
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>
>
>
> --
> *Victoria Concordia Crescit*

-- 
*Victoria Concordia Crescit*