[R] bigest part of vector

Tue Feb 24 23:34:32 CET 2009

Well, yes to all -- that's why I pointed this out. Sorting is (depending on
details) >= O(n log(n)) while the quantile algorithm uses R's partial sort
option, which I think would be O(n) for a single quantile. For n of a few
thousand or so, the difference shouldn't be noticeable. For large n, of
course it will be ...

Ties can inherently make any way of doing it ambiguous.

order() is a different kettle of fish, as it can simultaneously sort on many
columns; so "surprise" that it takes longer seems inappropriate (but perhaps
I shouldn't say that, as my comment presumes insight into the psychology of
"surprise" that I don't have).

-- Bert

-----Original Message-----
From: macrakis at gmail.com [mailto:macrakis at gmail.com] On Behalf Of Stavros
Macrakis
Sent: Tuesday, February 24, 2009 2:12 PM
To: Peterko
Cc: Bert Gunter; r-help at r-project.org
Subject: Re: [R] bigest part of vector

On Tue, Feb 24, 2009 at 3:01 PM, Bert Gunter <gunter.berton at gene.com> wrote:
> Nothing wrong with prior suggestions, but strictly speaking, (fully)
sorting
> the vector is unnecessary.
>
> y[y > quantile(y, 1- p/length(y))]
>
> will do it without the (complete) sort. (But sorting is so efficient
anyway,
> I don't think you could notice any difference).

R uses an efficient quantile calculation, so it is significantly
faster for large data sets:

> big <- rnorm(1e7)
> system.time(res<-big[big>=quantile(big,(length(big)-1)/length(big))])
   user  system elapsed
   0.56    0.14    0.70
> system.time(res<-big[big>=quantile(big,(length(big)-100)/length(big))])
   user  system elapsed
   0.75    0.10    0.84
> system.time(res<-big[big>=quantile(big,(length(big)-10000)/length(big))])
   user  system elapsed
   0.61    0.08    0.68
> system.time(res<-big[big>=quantile(big,1/2)])
   user  system elapsed
   1.08    0.08    1.17

> system.time(res<-sort(big))
   user  system elapsed
   4.67    0.03    4.72
> system.time(res<-sort(big)[round(length(big)/2):length(big)])
   user  system elapsed
   4.71    0.10    4.82

Surprisingly, perhaps, "order" is much slower than "sort":

> big <- rnorm(1e7)
> system.time(res<-order(big))
   user  system elapsed
  21.07    0.05   21.14

And you do need to be careful about your handling of ties:

> test <- c(1,2,3,4,4,4)
> test[test>=quantile(test,5/6)]
[1] 4 4 4
> test[test>=quantile(test,6/6)]
[1] 4 4 4

Hope this helps.

        -s