[R] Find number of elements less than some number: Elegant/fastsolution needed

Ray Brownrigg Ray.Brownrigg at ecs.vuw.ac.nz
Fri Apr 15 04:45:27 CEST 2011


Ravi:

Well, that's a challenge!

Here is a solution which is faster than Bill's as long as length(y) << length(x).

f3 <- function(x, y) length(y)*ecdf(y)(x)

Then using your test on my workstation, I get for f2():
> tt
[1]  0.761  1.071  1.329  1.565  1.894  3.865 11.824

and for f3():
> tt
[1]  0.717  0.882  1.010  1.161  1.389  3.080 15.530

I suspect ecdf() is a little more intuitive than findInterval().

Hope this helps,
Ray Brownrigg

On Fri, 15 Apr 2011, Ravi Varadhan wrote:
> Bill's code is insanely fast!
> 
> f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
> 
>  n1 <- 1e07
> n2 <- 10^c(1,2,3,4,5,6,7)
> tt <- rep(NA, 7)
> x <- rnorm(n1)
> for (i in 1:length(n2)){
> y <- runif(n2[i])
> tt[i] <- system.time(a1 <- f2(x, y))[3]
> }
> 
> > tt
> 
> [1]  0.70  0.86  1.03  1.28  1.54  4.99 12.07
> 
> 
> I would be surprised if this can be beaten even in C/C++.
> 
> Ravi.
> ________________________________________
> From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf
> Of Douglas Bates [bates at stat.wisc.edu] Sent: Thursday, April 14, 2011 6:22
> PM
> To: William Dunlap
> Cc: r-help at r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel
> Subject: Re: [R] Find number of elements less than some number:
> Elegant/fastsolution needed
> 
> My colleague Sunduz Keles once mentioned a similar problem to me.  She
> had a large sample from a reference distribution and a test sample
> (both real-valued in her case) and she wanted, for each element of the
> test sample, the proportion of the reference sample that was less than
> the element.  It's a type of empirical p-value calculation.
> 
> I forget the exact sizes of the samples (do you remember, Sunduz?) but
> they were in the tens of thousands or larger.  Solutions in R tended
> to involve comparing every element in the test sample to every element
> in the reference sample but, of course, that is unnecessary.  If you
> sort both samples then you can start the comparisons for a particular
> element of the test sample at the element of the reference sample
> where the last comparison failed.
> 
> I was able to write a very short C++ function using the Rcpp package
> that provided about a 1000-fold increase in speed relative to the best
> I could do in R.  I don't have the script on this computer so I will
> post it tomorrow when I am back on the computer at the office.
> 
> Apologies for cross-posting to the Rcpp-devel list but I am doing so
> because this might make a good example of the usefulness of Rcpp and
> inline.
> 
> On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap at tibco.com> wrote:
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
> >> Sent: Thursday, April 14, 2011 12:35 PM
> >> To: r-help at r-project.org
> >> Subject: [R] Find number of elements less than some number:
> >> Elegant/fastsolution needed
> >> 
> >> Take vector x and a subset y:
> >> 
> >> x=1:10
> >> 
> >> y=c(4,5,7,9)
> >> 
> >> For each value in 'x', I want to know how many elements in
> >> 'y' are less than 'x'.
> >> 
> >> An example would be:
> >> 
> >> sapply(x,FUN=function(i) {length(which(y<i))})
> >> 
> >>  [1] 0 0 0 0 1 2 2 3 3 4
> >> 
> >> But this solution is far too slow when x and y have lengths
> >> in the millions.
> >> 
> >> I'm certain an elegant (and computationally efficient)
> >> solution exists, but I'm in the weeds at this point.
> > 
> > If x is sorted findInterval(x, y) would do it for <= but you
> > want strict <.  Try
> > 
> >  f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
> > 
> > where your version is
> > 
> >  f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
> > 
> > E.g.,
> > 
> >  > x <- sort(sample(1e6, size=1e5))
> >  > y <- sample(1e6, size=1e4, replace=TRUE)
> >  > system.time(r0 <- f0(x,y))
> >  > 
> >     user  system elapsed
> >    
> >    7.900   0.310   8.211
> >    
> >  > system.time(r2 <- f2(x,y))
> >  > 
> >     user  system elapsed
> >    
> >    0.000   0.000   0.004
> >    
> >  > identical(r0, r2)
> >  
> >  [1] TRUE
> > 
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> > 
> >> Any help is much appreciated.
> >> 
> >> Kevin
> >> 
> >> University of Manchester
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> Take two vectors x and y, where y is a subset of x:
> >> 
> >> x=1:10
> >> 
> >> y=c(2,5,6,9)
> >> 
> >> If y is removed from x, the original x values now have a new
> >> placement (index) in the resulting vector (new):
> >> 
> >> new=x[-y]
> >> 
> >> index=1:length(new)
> >> 
> >> The challenge is: How can I *quickly* and *efficiently*
> >> deduce the new 'index' value directly from the original 'x'
> >> value -- using only 'y' as an input?
> >> 
> >> In practice, I have very large matrices containing the 'x'
> >> values, and I need to convert them to the corresponding
> >> 'index' if the 'y' values are removed.
> >> 
> >>       [[alternative HTML version deleted]]



More information about the R-help mailing list