[R] Improving data processing efficiency
Charles C. Berry
cberry at tajo.ucsd.edu
Sat Jun 7 02:23:32 CEST 2008
On Fri, 6 Jun 2008, Daniel Folkinshteyn wrote:
>> p <- profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
>> That should at least help you see where the slow bits are.
> so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the
> biggest timesuckers...
> i suppose i'll try using matrices and see how that stacks up (since all my
> cols are numeric, should be a problem-free approach).
> but i'm really wondering if there isn't some neat vectorized approach i could
> use to avoid at least one of the nested loops...
As far as a vectorized solution, I'll bet you could do ALL the lookups of
non-issuers for all issuers with a single call to findInterval() (modulo
some cleanup afterwards) , but the trickery needed to do that would make
your code a bit opaque.
And in the end I doubt it would beat mapply() (read on...) by enough to
make it worthwhile.
What you are doing is conditional on industry group and quarter.
indus.quarter <- with(tfdat,
paste(as.character(DATE), as.character(HSICIG), sep=".")))
and then calls like this:
split( <various> , indus.quater[ relevant.subset ] )
you can create:
a list of all issuer market caps according to quarter and group,
a list of all non-issuer caps (that satisfy your 'since quarter'
restriction) according to quarter and group,
a list of all non issuer indexes (i.e. row numbers) that satisfy
that restriction according to quarter and group
Then you write a function that takes the elements of each list for a given
quarter-industry group, looks up the matching non-issuers for each issuer,
and returns their indexes.
findInterval() will allow you to do this lookup for all issuers in one
industry group in a given quarter simultaneously and greatly speed this
process (but you will need to deal with the possible non-uniqueness of the
non-issuer caps - perhaps by adding a tiny jitter() to the values).
Then you feed the function and the lists to mapply().
The result is a list of indexes on the original data.frame. You can
unsplit() this if you like, then use those indexes to build your final
p.s. and if this all seems like too much work, you should at least avoid
needlessly creating data.frames. Specifically
reorder things so that
industrypeers = <etc>
is only done ONCE for each industry group by quarter combination and
change stuff like
nrow(industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ]) > 0
any( industrypeers$Market.Cap.13f >= arow$Market.Cap.13f )
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help