[R] Sum of column from another df based of row values of df1

Shivam shivamsingh at gmail.com
Sun Sep 9 23:59:54 CEST 2012


Thanks a lot Jim, it works a treat. Just had to change the date format
in the mCALL as well. But you saved me 80 hours of fretting and
frustration. Really thankful for it.

Regards,
Shivam

On Mon, Sep 10, 2012 at 1:33 AM, jim holtman <jholtman at gmail.com> wrote:
> How about an improvement to 16 seconds.  The first thing to do is to
> convert you data to a matrix because accessing data in a dataframe is
> very expensive.  If you run Rprof on your code you will see that all
> the time is spent in retrieving the information.  Converting to a
> matrix and using matrix accessing is considerably faster.  I did
> convert the POSIXct to Date.  You were also paying a lot in the
> constant conversion of POSIXct to Date for your comparisons.  I just
> replicated your CALL to 1 million rows for testing.
>
>
>> CALL <-
> + structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
> + "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
> + 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
> + 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
> + "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
> + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
> + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
> + 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
> + 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
> + "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
> + "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
> + "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>>
>> VOL <-
> + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
> + 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
> + ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
> + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
> + 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
> + 6L), class = "data.frame")
>>
>> # convert to matrices for faster testing
>> mCALL <- cbind(CALL$DATE, CALL$EXPDATE)
>> mVOL <- cbind(as.Date(VOL$DATE), VOL$VOL)  # convert POSIXct to Date
>>
>> # create 1M rows in mCALL
>> mCALL <- rbind(mCALL, mCALL[rep(1L, 1e6),])
>>
>> result <- numeric(nrow(mCALL))
>> system.time({
> +     for (i in 1:nrow(mCALL)){
> +         result[i] <- sum(mVOL[(mVOL[, 1L] >= mCALL[i,1L])
> +                             & (mVOL[, 1L] <= mCALL[i, 2L]), 2L])
> +     }
> + })
>    user  system elapsed
>   15.94    0.00   16.07
>>
>>
>>
>>
>
>
> On Sun, Sep 9, 2012 at 2:58 PM, Shivam <shivamsingh at gmail.com> wrote:
>> Dear All,
>>
>> I need to sum a column from another dataframe based on the row values
>> of one dataframe. I am stuck in a loop trying to accomplish it and at
>> current speed it will take more than 80 hours to complete. Needless to
>> say I am looking for a more elegant/quicker solution. Really need some
>> help here. Here is the issue:
>>
>> I have a dataframe CALL (the dput of head is given below) which has
>> close to a million rows. There are 2 date columns which are of
>> importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
>> head given), which has 2 columns, DATE and VOL. It has the volatility
>> corresponding to each day and it has a total of 124 records
>> (corresponding to 6 months). I want to add another column in the CALL
>> dataframe which would contain the sum of all the volatilities from the
>> VOL df for the period specified by the interval of DATE and EXPDATE in
>> each row of CALL df.
>>
>> For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
>> '27-01-2011'. So I want the SUM column (A new column in CALL df) to
>> contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01
>> from the VOL dataframe.
>>
>> I have to repeat this process for all the rows in the dataframe. Here
>> is the for-loop version of the solution:
>>
>> for (k in 1:nrow(CALL)){
>> CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE
>> <= CALL$EXPDATE[k]))
>> }
>>
>> The loop will run for close to a million times, it has been running
>> for more than 10 hours and its just 12% complete. It would take more
>> than 80 hours to complete, not the mention the toll it would take on
>> my laptop. So is there a better way that I can accomplish this task?
>> Any input would be greatly appreciated. Below are the dput of the two
>> dataframes.
>>
>> One point of note is that there are only 124 DISTINCT values of DATE
>> and 6 DISTINCT values of EXPDATE, in case it can be used in some way.
>>
>>> dput(CALL)
>> structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
>> "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
>> 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
>> 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
>> "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
>> 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
>> 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
>> 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
>> 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
>> "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
>> "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
>> "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>>
>>
>>> dput(VOL)
>> structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
>> 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
>> ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
>> 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
>> 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
>> 6L), class = "data.frame")
>>
>> Please do let me know if any more information from my side would help
>> or if I need to explain the issue more clearly.
>>
>> Any minor improvement will be great help.
>>
>> Thanks in advance.
>>
>> -Shivam
>>
>>
>> --
>> *Victoria Concordia Crescit*
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.



-- 
*Victoria Concordia Crescit*




More information about the R-help mailing list