[Rd] choose(n,k) when n is almost integer

Wed Feb 3 10:47:31 CET 2010

On Tue, Feb 02, 2010 at 01:37:46PM +0100, Petr Savicky wrote:
> I would like to add some more information concerning the patch C
> to the function choose() proposed in the email
>   https://stat.ethz.ch/pipermail/r-devel/2009-December/056177.html
> 
> The patch uses transformations of choose(n, k), which are described in
>   http://www.cs.cas.cz/~savicky/R-devel/formulas_choose.pdf
> 
> The accuracy of the modified function choose(n, k) may be verified
> on randomly generated examples using Rmpfr package
>   http://cran.at.r-project.org/web/packages/Rmpfr/index.html
> and the script
>   http://www.cs.cas.cz/~savicky/R-devel/test_choose_2.R

Let me add an explanation of the script test_choose_2.R.

The script generates a vector of random real numbers n, which are from
a continuous distribution, but concentrate near integer values. The
original implementation of choose(m, k) considers m as an integer, if
abs(m - round(m)) <= 1e-7. The vector n is generated so that the
probability of abs(n[i] - round(n[i])) <= 1e-7 is approximately 0.1.
The distribution of n[i] - round(n[i]) is symmetric around 0, so we
get both n[i], which are close to an integer from below and from above.
On the other hand, the probability of abs(n[i] - round(n[i])) >= 0.3 is
approximately 0.1083404, so there are also numbers n[i], which are not
close to an integer.

The script calculates choose(n, k) for k in 0:209 (an ad hoc upper bound)
1. using the modified choose(n, k) from patch C
2. using the expression n(n-1)...(n-k+1)/(1 2 ... k) with Rmpfr numbers
   of precision 100 bits.
The relative difference of these two results is computed and its maximum
over all n[i] and k from 0 to a given bound is reported. The bounds on k are
chosen to be the numbers, whose last digit is 9, since the algorithm in choose(n,k)
is different for k <= 29 and k >= 30.

An upper bound of the relative rounding error of a single operation with
Rmpfr numbers of precision 100 bits is (1 + 2^-100). Hence, an upper bound
on the total relative error of n(n-1)...(n-k+1)/(1 2 ... k) is
(1 + 2^-100)^(2*209) \approx (1 + 2 * 209 * 2^-100) \approx 1 + 3.297e-28.
This is a negligible error compared to the errors reported by test_choose_2.R,
so the reported errors are the errors of the patched choose(n, k).

The errors reported by test_choose_2.R with patched choose(n,k) are in
a previous email.

Running test_choose_2.R with unpatched R version 2.11.0 (2010-02-01 r51089)
produces larger errors.

  > source("test_choose_2.R")
  k <= 9  max rel err = Inf 
  k <= 19  max rel err = Inf 

  > source("test_choose_2.R")
  k <= 9  max rel err = 0.1111111 
  k <= 19  max rel err = Inf 

  > source("test_choose_2.R")
  k <= 9  max rel err = Inf 
  k <= 19  max rel err = Inf 

  > source("test_choose_2.R")
  k <= 9  max rel err = 8.383718e-08 
  k <= 19  max rel err = 1.226306e-07 
  k <= 29  max rel err = 1.469050e-07 
  Error: segfault from C stack overflow

The Inf relative errors occur in cases, where choose(n, k) calculates 0,
but the correct result is not 0.

The stack overflow error is sometimes generated due to an infinite sequence
of transformations 
  choose(n, k) -> choose(n, n-k) -> choose(n, round(n-k))
which occur if k = 30 and n = 60 - eps. The reason for the transformation
  choose(n, k) -> choose(n, n-k)
is that
  k >= k_small_max = 30
  n is treated as an integer in R_IS_INT(n)
  n-k < k_small_max
So, choose(n, n-k) is called. There, we determine that n-k is almost an integer and
since n-k is the second argument of choose(n,k), it is explicitly rounded to an integer.
Since n = 2*k - eps, we have round(n-k) = round(k - eps) = k. The result is that
we again call choose(n, k) and this repeats until the stack overflow.

For example
  > choose(60 - 1e-9, 30)
  Error: segfault from C stack overflow

Besides patch C, which corrects this stack overflow, also the previous
patches A, B from 
  https://stat.ethz.ch/pipermail/r-devel/2009-December/056154.html
correct this, but have lower accuracy.

Petr Savicky.