[R] Improving data processing efficiency

Sat Jun 7 02:23:32 CEST 2008

On Fri, 6 Jun 2008, Daniel Folkinshteyn wrote:

>>  install.packages("profr")
>>  library(profr)
>>  p <- profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
>>  plot(p)
>>
>>  That should at least help you see where the slow bits are.
>>
>>  Hadley
>> 
> so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the 
> biggest timesuckers...
>
> i suppose i'll try using matrices and see how that stacks up (since all my 
> cols are numeric, should be a problem-free approach).
>
> but i'm really wondering if there isn't some neat vectorized approach i could 
> use to avoid at least one of the nested loops...
>

As far as a vectorized solution, I'll bet you could do ALL the lookups of 
non-issuers for all issuers with a single call to findInterval() (modulo 
some cleanup afterwards) , but the trickery needed to do that would make 
your code a bit opaque.

And in the end I doubt it would beat mapply() (read on...) by enough to 
make it worthwhile.

---

What you are doing is conditional on industry group and quarter.

So using

 	indus.quarter <- with(tfdat,
 		paste(as.character(DATE), as.character(HSICIG), sep=".")))

and then calls like this:

 	split( <various> , indus.quater[ relevant.subset ] )

you can create:

 	a list of all issuer market caps according to quarter and group,

 	a list of all non-issuer caps (that satisfy your 'since quarter'
 	restriction) according to quarter and group,

 	a list of all non issuer indexes (i.e. row numbers) that satisfy
 	that restriction according to quarter and group

Then you write a function that takes the elements of each list for a given 
quarter-industry group, looks up the matching non-issuers for each issuer, 
and returns their indexes.

findInterval() will allow you to do this lookup for all issuers in one 
industry group in a given quarter simultaneously and greatly speed this 
process (but you will need to deal with the possible non-uniqueness of the 
non-issuer caps - perhaps by adding a tiny jitter() to the values).

Then you feed the function and the lists to mapply().

The result is a list of indexes on the original data.frame. You can 
unsplit() this if you like, then use those indexes to build your final 
"result" data.frame.

HTH,

Chuck

p.s. and if this all seems like too much work, you should at least avoid 
needlessly creating data.frames. Specifically

reorder things so that

 	   industrypeers = <etc>

is only done ONCE for each industry group by quarter combination and 
change stuff like

nrow(industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ]) > 0

to

any( industrypeers$Market.Cap.13f >= arow$Market.Cap.13f )

> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901