[Rd] RE: [R] Benchmarking R, why sort() is so slow?

Martin Maechler Martin Maechler <maechler@stat.math.ethz.ch>
Fri, 27 Apr 2001 17:54:15 +0200


>>>>> "TL" == Thomas Lumley <tlumley@u.washington.edu> writes:

    TL> On Fri, 27 Apr 2001, Philippe Grosjean wrote:
    >> I suspect it should be a question of algorithm choice. May be sort() is
    >> important enough, including for large datasets, to place some improvement in
    >> the "to do" list for a further version...?
    >> 

    TL> Actually, I'm not convinced on the speed issue. Shellsort works
    TL> very well for vectors of lengths that R can reasonably
    TL> handle. Using a million record dataset in R you really can't get
    TL> much done in the 20seconds that a really good sorting routine might
    TL> save you.

    TL> Heapsort might be worth trying -- it has good worst-case behaviour
    TL> and is about half as fast as the typical case of quicksort.  Merge
    TL> sort might also be nice because it's stable: it leaves ties in the
    TL> order it found them, but it needs extra space. Still, I don't think
    TL> it's a high optimisation priority.

I tend to disagree.
For me, R is not just for data analysis.
It's my compute engine for almost everything, nowadays.
I definitely have wanted to add better sort() routines to R,
and I think R's default should be to use some  O(n log(n)) versions, at
least for larghish n.
20 years ago (or so), ``the'' good thing 
 {{for in-memory sorting; note that we might consider a different algorithm 
   for the case of a really large array where [i] is really disk access}}
was Singleton's Quickersort
  |  R.C. Singleton 'Revised Quickersort'
  |  (CACM Algorithm 347, Sept.1968)

(quicker than quicksort), a version of which (that does additionally return
 the sorted index) I had re-programmed in Pascal and then C.
I can put it up for ftp if there's interest.

The GSL (GNU scientific library) has sorting routines, too,
all of them heapsort.

>From an outdated manual,
     http://sources.redhat.com/gsl/ref/gsl-ref_17.html#SEC253

>> Sorting

>> This chapter describes functions for sorting data, both directly and
>> indirectly (using an index). All the functions use the heapsort
>> algorithm. Heapsort is an O(N \log N) algorithm which operates
>> in-place. It does not require any additional storage and provides
>> consistent performance. The running time for its worst-case (ordered
>> data) which is not significantly different from the average and best
>> cases. Note that the heapsort algorithm does not preserve the relative
>> ordering of equal elements -- it is an unstable sort. However the
>> resulting order of equal elements will be consistent across different
>> platforms when using these functions.

>> Sorting objects
>> 
>> The following function provides a simple alternative to the standard
>> library function qsort. It is intended for systems lacking qsort, not as
>> a replacement for it. The function qsort should be used whenever
>> possible, as it will be faster and can provide stable ordering of equal
>> elements. Documentation for qsort is available in the GNU C Library
>> Reference Manual.

I.e., they seem to say one should really use  qsort() whenever that is
available... hmm,
could we use  configure  to find if qsort() is available?

Martin

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._