[R] Improving data processing efficiency

Gabor Grothendieck ggrothendieck at gmail.com
Fri Jun 6 19:35:48 CEST 2008


I think the posting guide may not be clear enough and have suggested that
it be clarified.  Hopefully this better communicates what is required and why
in a shorter amount of space:

https://stat.ethz.ch/pipermail/r-devel/2008-June/049891.html


On Fri, Jun 6, 2008 at 1:25 PM, Daniel Folkinshteyn <dfolkins at gmail.com> wrote:
> i thought since the function code (which i provided in full) was pretty
> short, it would be reasonably easy to just read the code and see what it's
> doing.
>
> but ok, so... i am attaching a zip file, with a small sample of the data set
> (tab delimited), and the function code, in a zip file (posting guidelines
> claim that "some archive formats" are allowed, i assume zip is one of
> them...
>
> would appreciate your comments! :)
>
> on 06/06/2008 12:05 PM Gabor Grothendieck said the following:
>>
>> Its summarized in the last line to r-help.  Note reproducible and
>> minimal.
>>
>> On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn <dfolkins at gmail.com>
>> wrote:
>>>
>>> i did! what did i miss?
>>>
>>> on 06/06/2008 11:45 AM Gabor Grothendieck said the following:
>>>>
>>>> Try reading the posting guide before posting.
>>>>
>>>> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn
>>>> <dfolkins at gmail.com>
>>>> wrote:
>>>>>
>>>>> Anybody have any thoughts on this? Please? :)
>>>>>
>>>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>>>>>
>>>>>> Hi everyone!
>>>>>>
>>>>>> I have a question about data processing efficiency.
>>>>>>
>>>>>> My data are as follows: I have a data set on quarterly institutional
>>>>>> ownership of equities; some of them have had recent IPOs, some have
>>>>>> not
>>>>>> (I
>>>>>> have a binary flag set). The total dataset size is 700k+ rows.
>>>>>>
>>>>>> My goal is this: For every quarter since issue for each IPO, I need to
>>>>>> find a "matched" firm in the same industry, and close in market cap.
>>>>>> So,
>>>>>> e.g., for firm X, which had an IPO, i need to find a matched
>>>>>> non-issuing
>>>>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing
>>>>>> firm in
>>>>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are
>>>>>> about
>>>>>> 8300
>>>>>> of these).
>>>>>>
>>>>>> Thus it seems to me that I need to be doing a lot of data selection
>>>>>> and
>>>>>> subsetting, and looping (yikes!), but the result appears to be highly
>>>>>> inefficient and takes ages (well, many hours). What I am doing, in
>>>>>> pseudocode, is this:
>>>>>>
>>>>>> 1. for each quarter of data, getting out all the IPOs and all the
>>>>>> eligible
>>>>>> non-issuing firms.
>>>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same
>>>>>> industry, sort them by size, and finally grab a matching firm closest
>>>>>> in
>>>>>> size (the exact procedure is to grab the closest bigger firm if one
>>>>>> exists,
>>>>>> and just the biggest available if all are smaller)
>>>>>> 3. assign the matched firm-observation the same "quarters since issue"
>>>>>> as
>>>>>> the IPO being matched
>>>>>> 4. rbind them all into the "matching" dataset.
>>>>>>
>>>>>> The function I currently have is pasted below, for your reference. Is
>>>>>> there any way to make it produce the same result but much faster?
>>>>>> Specifically, I am guessing eliminating some loops would be very good,
>>>>>> but I
>>>>>> don't see how, since I need to do some fancy footwork for each IPO in
>>>>>> each
>>>>>> quarter to find the matching firm. I'll be doing a few things similar
>>>>>> to
>>>>>> this, so it's somewhat important to up the efficiency of this. Maybe
>>>>>> some of
>>>>>> you R-fu masters can clue me in? :)
>>>>>>
>>>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>>>>
>>>>>> ========== my function below ===========
>>>>>>
>>>>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
>>>>>> quarters_since_issue=40) {
>>>>>>
>>>>>>  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
>>>>>> cheaper, so typecast the result to matrix
>>>>>>
>>>>>>  colnames = names(tfdata)
>>>>>>
>>>>>>  quarterends = sort(unique(tfdata$DATE))
>>>>>>
>>>>>>  for (aquarter in quarterends) {
>>>>>>      tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>>>>
>>>>>>      tfdata_quarter_fitting_nonissuers = tfdata_quarter[
>>>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
>>>>>> (tfdata_quarter$IPO.Flag == 0), ]
>>>>>>      tfdata_quarter_ipoissuers = tfdata_quarter[
>>>>>> tfdata_quarter$IPO.Flag
>>>>>> == 1, ]
>>>>>>
>>>>>>      for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>>>>          arow = tfdata_quarter_ipoissuers[i,]
>>>>>>          industrypeers = tfdata_quarter_fitting_nonissuers[
>>>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>>>>          industrypeers = industrypeers[
>>>>>> order(industrypeers$Market.Cap.13f), ]
>>>>>>          if ( nrow(industrypeers) > 0 ) {
>>>>>>              if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=
>>>>>> arow$Market.Cap.13f, ]) > 0 ) {
>>>>>>                  bestpeer = industrypeers[industrypeers$Market.Cap.13f
>>>>>>>
>>>>>>> = arow$Market.Cap.13f, ][1,]
>>>>>>
>>>>>>              }
>>>>>>              else {
>>>>>>                  bestpeer = industrypeers[nrow(industrypeers),]
>>>>>>              }
>>>>>>              bestpeer$Quarters.Since.IPO.Issue =
>>>>>> arow$Quarters.Since.IPO.Issue
>>>>>>
>>>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
>>>>>> bestpeer$PERMNO] = 1
>>>>>>              result = rbind(result, as.matrix(bestpeer))
>>>>>>          }
>>>>>>      }
>>>>>>      #result = rbind(result, tfdata_quarter)
>>>>>>      print (aquarter)
>>>>>>  }
>>>>>>
>>>>>>  result = as.data.frame(result)
>>>>>>  names(result) = colnames
>>>>>>  return(result)
>>>>>>
>>>>>> }
>>>>>>
>>>>>> ========= end of my function =============
>>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>
>



More information about the R-help mailing list