[R] Improving data processing efficiency

Fri Jun 6 19:29:56 CEST 2008

thanks for the tip! i'll try that and see how big of a difference that 
makes... if i am not sure what exactly the size will be, am i better off 
making it larger, and then later stripping off the blank rows, or making 
it smaller, and appending the missing rows?

on 06/06/2008 11:44 AM Patrick Burns said the following:
> One thing that is likely to speed the code significantly
> is if you create 'result' to be its final size and then
> subscript into it.  Something like:
> 
>   result[i, ] <- bestpeer
> 
> (though I'm not sure if 'i' is the proper index).
> 
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
> 
> Daniel Folkinshteyn wrote:
>> Anybody have any thoughts on this? Please? :)
>>
>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>> Hi everyone!
>>>
>>> I have a question about data processing efficiency.
>>>
>>> My data are as follows: I have a data set on quarterly institutional 
>>> ownership of equities; some of them have had recent IPOs, some have 
>>> not (I have a binary flag set). The total dataset size is 700k+ rows.
>>>
>>> My goal is this: For every quarter since issue for each IPO, I need 
>>> to find a "matched" firm in the same industry, and close in market 
>>> cap. So, e.g., for firm X, which had an IPO, i need to find a matched 
>>> non-issuing firm in quarter 1 since IPO, then a (possibly different) 
>>> non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing 
>>> firm (there are about 8300 of these).
>>>
>>> Thus it seems to me that I need to be doing a lot of data selection 
>>> and subsetting, and looping (yikes!), but the result appears to be 
>>> highly inefficient and takes ages (well, many hours). What I am 
>>> doing, in pseudocode, is this:
>>>
>>> 1. for each quarter of data, getting out all the IPOs and all the 
>>> eligible non-issuing firms.
>>> 2. for each IPO in a quarter, grab all the non-issuers in the same 
>>> industry, sort them by size, and finally grab a matching firm closest 
>>> in size (the exact procedure is to grab the closest bigger firm if 
>>> one exists, and just the biggest available if all are smaller)
>>> 3. assign the matched firm-observation the same "quarters since 
>>> issue" as the IPO being matched
>>> 4. rbind them all into the "matching" dataset.
>>>
>>> The function I currently have is pasted below, for your reference. Is 
>>> there any way to make it produce the same result but much faster? 
>>> Specifically, I am guessing eliminating some loops would be very 
>>> good, but I don't see how, since I need to do some fancy footwork for 
>>> each IPO in each quarter to find the matching firm. I'll be doing a 
>>> few things similar to this, so it's somewhat important to up the 
>>> efficiency of this. Maybe some of you R-fu masters can clue me in? :)
>>>
>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>
>>> ========== my function below ===========
>>>
>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
>>> quarters_since_issue=40) {
>>>
>>>     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
>>> cheaper, so typecast the result to matrix
>>>
>>>     colnames = names(tfdata)
>>>
>>>     quarterends = sort(unique(tfdata$DATE))
>>>
>>>     for (aquarter in quarterends) {
>>>         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>
>>>         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & 
>>> (tfdata_quarter$IPO.Flag == 0), ]
>>>         tfdata_quarter_ipoissuers = tfdata_quarter[ 
>>> tfdata_quarter$IPO.Flag == 1, ]
>>>
>>>         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>             arow = tfdata_quarter_ipoissuers[i,]
>>>             industrypeers = tfdata_quarter_fitting_nonissuers[ 
>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>             industrypeers = industrypeers[ 
>>> order(industrypeers$Market.Cap.13f), ]
>>>             if ( nrow(industrypeers) > 0 ) {
>>>                 if ( nrow(industrypeers[industrypeers$Market.Cap.13f 
>>> >= arow$Market.Cap.13f, ]) > 0 ) {
>>>                     bestpeer = 
>>> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
>>>                 }
>>>                 else {
>>>                     bestpeer = industrypeers[nrow(industrypeers),]
>>>                 }
>>>                 bestpeer$Quarters.Since.IPO.Issue = 
>>> arow$Quarters.Since.IPO.Issue
>>>
>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
>>> bestpeer$PERMNO] = 1
>>>                 result = rbind(result, as.matrix(bestpeer))
>>>             }
>>>         }
>>>         #result = rbind(result, tfdata_quarter)
>>>         print (aquarter)
>>>     }
>>>
>>>     result = as.data.frame(result)
>>>     names(result) = colnames
>>>     return(result)
>>>
>>> }
>>>
>>> ========= end of my function =============
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>