[R] Improving data processing efficiency

Fri Jun 6 18:05:21 CEST 2008

Its summarized in the last line to r-help.  Note reproducible and
minimal.

On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn <dfolkins at gmail.com> wrote:
> i did! what did i miss?
>
> on 06/06/2008 11:45 AM Gabor Grothendieck said the following:
>>
>> Try reading the posting guide before posting.
>>
>> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn <dfolkins at gmail.com>
>> wrote:
>>>
>>> Anybody have any thoughts on this? Please? :)
>>>
>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>>>
>>>> Hi everyone!
>>>>
>>>> I have a question about data processing efficiency.
>>>>
>>>> My data are as follows: I have a data set on quarterly institutional
>>>> ownership of equities; some of them have had recent IPOs, some have not
>>>> (I
>>>> have a binary flag set). The total dataset size is 700k+ rows.
>>>>
>>>> My goal is this: For every quarter since issue for each IPO, I need to
>>>> find a "matched" firm in the same industry, and close in market cap. So,
>>>> e.g., for firm X, which had an IPO, i need to find a matched non-issuing
>>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing
>>>> firm in
>>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are about
>>>> 8300
>>>> of these).
>>>>
>>>> Thus it seems to me that I need to be doing a lot of data selection and
>>>> subsetting, and looping (yikes!), but the result appears to be highly
>>>> inefficient and takes ages (well, many hours). What I am doing, in
>>>> pseudocode, is this:
>>>>
>>>> 1. for each quarter of data, getting out all the IPOs and all the
>>>> eligible
>>>> non-issuing firms.
>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same
>>>> industry, sort them by size, and finally grab a matching firm closest in
>>>> size (the exact procedure is to grab the closest bigger firm if one
>>>> exists,
>>>> and just the biggest available if all are smaller)
>>>> 3. assign the matched firm-observation the same "quarters since issue"
>>>> as
>>>> the IPO being matched
>>>> 4. rbind them all into the "matching" dataset.
>>>>
>>>> The function I currently have is pasted below, for your reference. Is
>>>> there any way to make it produce the same result but much faster?
>>>> Specifically, I am guessing eliminating some loops would be very good,
>>>> but I
>>>> don't see how, since I need to do some fancy footwork for each IPO in
>>>> each
>>>> quarter to find the matching firm. I'll be doing a few things similar to
>>>> this, so it's somewhat important to up the efficiency of this. Maybe
>>>> some of
>>>> you R-fu masters can clue me in? :)
>>>>
>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>>
>>>> ========== my function below ===========
>>>>
>>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
>>>> quarters_since_issue=40) {
>>>>
>>>>   result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
>>>> cheaper, so typecast the result to matrix
>>>>
>>>>   colnames = names(tfdata)
>>>>
>>>>   quarterends = sort(unique(tfdata$DATE))
>>>>
>>>>   for (aquarter in quarterends) {
>>>>       tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>>
>>>>       tfdata_quarter_fitting_nonissuers = tfdata_quarter[
>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
>>>> (tfdata_quarter$IPO.Flag == 0), ]
>>>>       tfdata_quarter_ipoissuers = tfdata_quarter[
>>>> tfdata_quarter$IPO.Flag
>>>> == 1, ]
>>>>
>>>>       for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>>           arow = tfdata_quarter_ipoissuers[i,]
>>>>           industrypeers = tfdata_quarter_fitting_nonissuers[
>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>>           industrypeers = industrypeers[
>>>> order(industrypeers$Market.Cap.13f), ]
>>>>           if ( nrow(industrypeers) > 0 ) {
>>>>               if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=
>>>> arow$Market.Cap.13f, ]) > 0 ) {
>>>>                   bestpeer = industrypeers[industrypeers$Market.Cap.13f
>>>>>
>>>>> = arow$Market.Cap.13f, ][1,]
>>>>
>>>>               }
>>>>               else {
>>>>                   bestpeer = industrypeers[nrow(industrypeers),]
>>>>               }
>>>>               bestpeer$Quarters.Since.IPO.Issue =
>>>> arow$Quarters.Since.IPO.Issue
>>>>
>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
>>>> bestpeer$PERMNO] = 1
>>>>               result = rbind(result, as.matrix(bestpeer))
>>>>           }
>>>>       }
>>>>       #result = rbind(result, tfdata_quarter)
>>>>       print (aquarter)
>>>>   }
>>>>
>>>>   result = as.data.frame(result)
>>>>   names(result) = colnames
>>>>   return(result)
>>>>
>>>> }
>>>>
>>>> ========= end of my function =============
>>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>