[R] Thoughts for faster indexing

Tue Nov 26 21:23:51 CET 2013

Hi,

On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:
> All interesting suggestions.
>
> I guess a better example of the code would have been a good idea.  So,
> I'll put a relevant snippet here.
>
> Rows are cases.  There are multiple cases for each ID, marked with a
> date.  I'm trying to calculate a time recency weighted score for a
> covariate, added as a new column in the data.frame.
>
> So, for each row, I need to see which ID it belongs to, then get all the
> scores prior to this row's date, then compute the recency weighted summary.
>
> Right now, I do this in an obvious, but very very slow way.
>
> Here is my slow code:
> ======================
> for(i in 1:nrow(d)){
>     for(j in which( d$id == d$id[i] & d$date[j] < d$date[i]) ){
>         days_since = as.numeric( d$date[i] - d$date[j] )
>         w <- exp( -days_since/decay )
>         temp <- temp + w * as.numeric(d[j,'score'])
>         wTemp <- wTemp + w
>     }
>
>     temp <- temp / wTemp
>     d$newScore[i,] <- temp
> }
> ======================
>
> One immediate thought was to turn the "date" into an integer.  That
> should save a few cycles of date math.
>
> I need to do this process for a bunch of scores.  A grid search over
> different time decay levels might be nice.  So any speedup to this
> routine will save me a ton of time.
>
> Ideas?

A few quick ones.

You had said you tried data.table and found it to be slow still -- my
guess is that you might not have used it correctly, so here is a rough
sketch of what to do.

Let's assume that your date is converted to some integer -- I will
leave that excercise to you :-) -- but it seems like you just want to
calculate number of (whole) days since an event that you have a record
for, so this should be (in principle) easy to do (if you really need
full power of "date math", data.table supports that as well).

Also you never "reset" your `temp` variable, so it looks like you are
carrying over `temp` from one `id` group to the next (and, while I
have no knowledge of your problem, I would imagine this is not what
you want to do)

Anyway some rough ideas to get you started:

R> d <- as.data.table(d)
R> setkeyv(d, c('id', 'date'))

Now records within each date are ordered from first to last.

The specifics of your decay score escape me a bit, eg. what is the
value of "days_since" for the first record of each id? I'll let you
figure that out, but in the non-edge cases, it looks like you can just
calculate "days since" by subtracting the current date from the date
recorded in the record before it. (Note that `.I` is special
data.table variable for the row number of a given record in the
original data.table):

d[, newScore := {
  ## handle edge case for first record w/in each `id` group
  days_since <- date - d$date[.I -1]
  w <- exp(-days_since / decay)
  ## ...
  ## Some other stuff you are doing here which I can't
  ## understand with temp ... then multiple the 'score' column
  ## for the given row by the your correctly calculated weight `w`
  ## for that row (whatever it might be).
  w * score
}, by='id']

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech