[R] Improving data processing efficiency

Daniel Folkinshteyn dfolkins at gmail.com
Fri Jun 6 17:12:32 CEST 2008


Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
> Hi everyone!
> 
> I have a question about data processing efficiency.
> 
> My data are as follows: I have a data set on quarterly institutional 
> ownership of equities; some of them have had recent IPOs, some have not 
> (I have a binary flag set). The total dataset size is 700k+ rows.
> 
> My goal is this: For every quarter since issue for each IPO, I need to 
> find a "matched" firm in the same industry, and close in market cap. So, 
> e.g., for firm X, which had an IPO, i need to find a matched non-issuing 
> firm in quarter 1 since IPO, then a (possibly different) non-issuing 
> firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there 
> are about 8300 of these).
> 
> Thus it seems to me that I need to be doing a lot of data selection and 
> subsetting, and looping (yikes!), but the result appears to be highly 
> inefficient and takes ages (well, many hours). What I am doing, in 
> pseudocode, is this:
> 
> 1. for each quarter of data, getting out all the IPOs and all the 
> eligible non-issuing firms.
> 2. for each IPO in a quarter, grab all the non-issuers in the same 
> industry, sort them by size, and finally grab a matching firm closest in 
> size (the exact procedure is to grab the closest bigger firm if one 
> exists, and just the biggest available if all are smaller)
> 3. assign the matched firm-observation the same "quarters since issue" 
> as the IPO being matched
> 4. rbind them all into the "matching" dataset.
> 
> The function I currently have is pasted below, for your reference. Is 
> there any way to make it produce the same result but much faster? 
> Specifically, I am guessing eliminating some loops would be very good, 
> but I don't see how, since I need to do some fancy footwork for each IPO 
> in each quarter to find the matching firm. I'll be doing a few things 
> similar to this, so it's somewhat important to up the efficiency of 
> this. Maybe some of you R-fu masters can clue me in? :)
> 
> I would appreciate any help, tips, tricks, tweaks, you name it! :)
> 
> ========== my function below ===========
> 
> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
> quarters_since_issue=40) {
> 
>     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
> cheaper, so typecast the result to matrix
> 
>     colnames = names(tfdata)
> 
>     quarterends = sort(unique(tfdata$DATE))
> 
>     for (aquarter in quarterends) {
>         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
> 
>         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & 
> (tfdata_quarter$IPO.Flag == 0), ]
>         tfdata_quarter_ipoissuers = tfdata_quarter[ 
> tfdata_quarter$IPO.Flag == 1, ]
> 
>         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>             arow = tfdata_quarter_ipoissuers[i,]
>             industrypeers = tfdata_quarter_fitting_nonissuers[ 
> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>             industrypeers = industrypeers[ 
> order(industrypeers$Market.Cap.13f), ]
>             if ( nrow(industrypeers) > 0 ) {
>                 if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= 
> arow$Market.Cap.13f, ]) > 0 ) {
>                     bestpeer = 
> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
>                 }
>                 else {
>                     bestpeer = industrypeers[nrow(industrypeers),]
>                 }
>                 bestpeer$Quarters.Since.IPO.Issue = 
> arow$Quarters.Since.IPO.Issue
> 
> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
> bestpeer$PERMNO] = 1
>                 result = rbind(result, as.matrix(bestpeer))
>             }
>         }
>         #result = rbind(result, tfdata_quarter)
>         print (aquarter)
>     }
> 
>     result = as.data.frame(result)
>     names(result) = colnames
>     return(result)
> 
> }
> 
> ========= end of my function =============
>



More information about the R-help mailing list