R: Find Interval Numbers or Indices

findInterval {base}

R Documentation

Find Interval Numbers or Indices

Description

Given a vector of non-decreasing breakpoints in vec, find the interval containing each element of x; i.e., if i <- findInterval(x,v), for each index j in x v_{i_j} \le x_j < v_{i_j + 1} where v_0 := -\infty, v_{N+1} := +\infty, and N <- length(v). At the two boundaries, the returned index may differ by 1, depending on the optional arguments rightmost.closed and all.inside.

Usage

findInterval(x, vec, rightmost.closed = FALSE, all.inside = FALSE,
             left.open = FALSE, checkSorted = TRUE, checkNA = TRUE)

Arguments

x

numeric.

vec

numeric, sorted (weakly) increasingly, of length N, say.

rightmost.closed

logical; if true, the rightmost interval, vec[N-1] .. vec[N] is treated as closed, see below.

all.inside

logical; if true, the returned indices are coerced into 1,...,N-1, i.e., 0 is mapped to 1 and N to N-1.

left.open

logical; if true all the intervals are open at left and closed at right; in the formulas below, \le should be swapped with < (and > with \ge), and rightmost.closed means ‘leftmost is closed’. This may be useful, e.g., in survival analysis computations.

checkSorted

logical indicating if vec should be checked, i.e., is.unsorted(vec) is asserted to be false. Setting this to FALSE skips the check gaining speed, but may return nonsense results in case vec is not sorted.

checkNA

logical indicating if each x[i] should be checked as with is.na(.). Setting this to FALSE in case of NA's in x[] may result in platform dependent nonsense.

Details

The function findInterval finds the index of one vector x in another, vec, where the latter must be non-decreasing. Where this is trivial, equivalent to apply( outer(x, vec, `>=`), 1, sum), as a matter of fact, the internal algorithm uses interval search ensuring O(n \log N) complexity where n <- length(x) (and N <- length(vec)). For (almost) sorted x, it will be even faster, basically O(n).

This is the same computation as for the empirical distribution function, and indeed, findInterval(t, sort(X)) is identical to n F_n(t; X_1,\dots,X_n) where F_n is the empirical distribution function of X_1,\dots,X_n.

When rightmost.closed = TRUE, the result for x[j] = vec[N] ( = \max vec), is N - 1 as for all other values in the last interval.

left.open = TRUE is occasionally useful, e.g., for survival data. For (anti-)symmetry reasons, it is equivalent to using “mirrored” data, i.e., the following is always true:

    identical(
          findInterval( x,  v,      left.open= TRUE, ...) ,
      N - findInterval(-x, -v[N:1], left.open=FALSE, ...) )

where N <- length(vec) as above.

Value

vector of length length(x) with values in 0:N (and NA) where N <- length(vec), or values coerced to 1:(N-1) if and only if all.inside = TRUE (equivalently coercing all x values inside the intervals). Note that NAs are propagated from x, and Inf values are allowed in both x and vec.

Author(s)

Martin Maechler

Examples

x <- 2:18
v <- c(5, 10, 15) # create two bins [5,10) and [10,15)
cbind(x, findInterval(x, v))

N <- 100
X <- sort(round(stats::rt(N, df = 2), 2))
tt <- c(-100, seq(-2, 2, length.out = 201), +100)
it <- findInterval(tt, X)
tt[it < 1 | it >= N] # only first and last are outside range(X)
stopifnot(identical(it, ## suppressing the checks is faster *BUT* dangerous, unless
                    ##     you *know* that X is sorted   and   tt contains no NA's
                    findInterval(tt, X, checkSorted=FALSE, checkNA=FALSE)))

##  'left.open = TRUE' means  "mirroring" :
N <- length(v)
stopifnot(identical(
                  findInterval( x,  v,  left.open=TRUE) ,
              N - findInterval(-x, -v[N:1])))

[Package base version 4.5.0 Index]