[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Sat Jul 29 02:31:44 CEST 2006

On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

> Hi Bill,
>
> > Splus8.0 has something like what you are talking about
> > that provides a fast way to compute
> >     sapply(split(xVector, integerGroupCode), summaryFunction)
> > for some common summary functions.  The 'integerGroupCode'
> > is typically the codes from a factor, but you could compute
> > it in other ways.  It needs to be a "small" integer in
> > the range 1:ngroups (like the 'bin' argument to tabulate).
> > Like tabulate(), which is called from table(), these are
> > meant to be called from other functions that can set up
> > appropriate group codes.  E.g., groupSums or rowSums or
> > fancier things could be based on this.
> >
> > They do not insist you sort the input in any way.  That
> > would really only be useful for group medians and we haven't
> > written that one yet.
>
> The sort is also useful for recoding each group into subgroups based
> on some other numeric vector.  This is the problem I run into trying
> to build portfolios that can be used as benchmarks for long term
> stock returns.
>
> Another issue I have is that to recode a long character string that I
> use as a sort key for accessing a subgroup of the data in the
> data.frame to a set of small integers is not fast.  I can make a fast
> implementation if the data is sorted by the key, but without the
> sort, just converting my sort keys to the required small integer
> codes would be expensive for very long vectors since my small integer
> codes would have to reflect the order of the data (ie. be increasing
> subportfolio numbers).

True, but the underlying grouped summary code
shouldn't require you to do the sorting.  If
    codes <- match(char, sort(unique(char)))
is too slow then you could try sorting the
data set by th 'char' column and doing
    codes <- cumsum(char[-1] != char[-length(char)])
(loop over columns before doing cumsum if there
are several columns).

If that isn't fast enough, then you could sort
in the C code as well, but I think there could
be lots of cases where that is slower.

I've used this code for out of core applications,
where I definitely do not want to sort the whole
dataset.

> More specifically, I am now converting all of my SAS code to R code
> and the problem is I have lots of snippets of SAS that do the
> following ...
>
> PROC SORT;
>    BY MDSIZ FSIZ;
>
> /* WRITE OUT THE MIN SIZE CUTOFF VALUES */
> PROC UNIVARIATE NOPRINT;
>     VAR FSIZ;
>     BY MDSIZ;
>     OUTPUT OUT=TMPS1 MIN=XMIN;
>
> where my sort key MDSIZ is a character string that is the
> concatenation of the month ending date MD and the size portfolio of a
> particular firm (SIZ) and I want to find the cutoff points (the mins)
> for each of the portfolios for every month end date across all traded
> firms.
>
>
> >
> > The typical prototype is
> >
> >> igroupSums
> > function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
> > (is.null(
> >         group)) 1 else max(as.integer(group), na.rm = T))
> >
> > and the currently supported summary functions are
> >    mean : igroupMeans
> >    sum : igroupSums
> >    prod : igroupProds
> >    min : igroupMins
> >    max : igroupMaxs
> >    range : igroupRanges
> >    any : igroupAnys
> >    all : igroupAlls
>
> SAS is similar in that is also has a specific list of functions you
> can request including all of the basic stats from a PROC univariate
> including higher moment stuff (skewness, kurtosis,  robust
> statistics, and even statistical test results for each coded
> subgroup, and the nice thing is all combinations can be done with one
> call.
>
> But to do that SAS does require the presorting, but it does run
> really fast for even super long vectors with lots of sort keys.
>
> Similarly the next snippet of code, will take the file and resort it
> by the portfolio key and then the market to book ratio (MTB) for all
> trading firms for all monthly periods since 1980.    It will then
> split each size portfolio for each month ending date into 5 equal
> portfolios based on market to book ratios (thus the need for the
> sort).   SAS returns a coded integer vector PMTB (made up of 1s to 5
> with 1s's for the smallest MTB and 5 for the largest MTB) repeated
> for each subgroup of MDSIZ.  PMTB matches the original vector in
> length and therefore fits right into the data frame.
>
> /* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
> PROC SORT;
>    BY MDSIZ MTB;
> PROC RANK GROUPS=5 OUT=TMPS0;
>     VAR MTB;
>     RANKS PMTB;
>     BY MDSIZ;
>
> The problem of assigning elements of a long data vector to portfolios
> and sub portfolios based on the values of specific data columns which
> must be calculated at each step and are not fixed or hardcoded is one
> that finance can run into (and therefore I run into it).
>
> So by sorting I could handle the need for "small integer" recoding
> and the small integers would have meaning (i.e. higher values will
> represent larger MTB firms, etc).
>
> That just leaves the problem of calculating stats on short sequences
> of of a longer integer.
>
> > They are fast:
> >
> >> x<-runif(2e6)
> >> i<-rep(1:1e6, 2)
> >> sys.time(sx <- igroupSums(x,i))
> >    [1] 0.66 0.67
> >> length(sx)
> >    [1] 1000000
> >
> > On that machine R takes 44 seconds to go the lapply/split
> > route:
> >
> >> unix.time(unlist(lapply(split(x,i), sum)))
> >    [1] 43.24  0.78 44.11  0.00  0.00
>
>
> Yes!  That is exactly what I need.
>
> Are there plans for adding something like that to R?
>
> Thanks,
>
> Kevin
>
>

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."