[Rd] choose(n,k) when n is almost integer

Sat Dec 19 22:02:49 CET 2009

{I've changed the subject;  it's really no longer about
 lchoose()'s definition}

>>>>> Petr Savicky <savicky at cs.cas.cz>
>>>>>     on Fri, 18 Dec 2009 00:14:49 +0100 writes:

    > On Thu, Dec 17, 2009 at 03:10:49PM +0100, Martin Maechler wrote:
    > [...]
    MM> This, of course, is an even more compelling reason to implement
    MM> the change of return  log(abs(choose(.,.)),
    MM> and at the moment, I'd even plan to "backport" that to R "2.10.1
    MM> patched", as the current behavior is clearly bugous.
    >> 
    >> This has now happened;
    >> svn revisions  50775 and 50776  {R-devel and R-patched}.

    > Thank you.

    > When preparing a test, whether lchoose(n, k) \approx log(abs(choose(n, k))), it
    > appeared that there is a minor problem also in choose(n, k), when n is very close
    > to k, but not equal.

    > options(digits=15)
    > n <- 5 + (-15:15)*1e-8
    > cbind(n, choose(n, 5))
    > #                n                  
    > #  [1,] 4.99999985 0.999999657500042
    > #  [2,] 4.99999986 0.999999680333370
    > #  [3,] 4.99999987 0.999999703166698
    > #  [4,] 4.99999988 0.999999726000027
    > #  [5,] 4.99999989 0.999999748833356
    > #  [6,] 4.99999990 0.999999771666685
    > #  [7,] 4.99999991 0.000000000000000
    > #  [8,] 4.99999992 0.000000000000000
    > #  [9,] 4.99999993 0.000000000000000
    > # [10,] 4.99999994 0.000000000000000
    > # [11,] 4.99999995 0.000000000000000
    > # [12,] 4.99999996 0.000000000000000
    > # [13,] 4.99999997 0.000000000000000
    > # [14,] 4.99999998 0.000000000000000
    > # [15,] 4.99999999 0.000000000000000
    > # [16,] 5.00000000 1.000000000000000
    > # [17,] 5.00000001 5.000000000000000
    > # [18,] 5.00000002 5.000000000000000
    > # [19,] 5.00000003 5.000000000000000
    > # [20,] 5.00000004 5.000000000000000
    > # [21,] 5.00000005 5.000000000000000
    > # [22,] 5.00000006 5.000000000000000
    > # [23,] 5.00000007 5.000000000000000
    > # [24,] 5.00000008 5.000000000000000
    > # [25,] 5.00000009 5.000000000000000
    > # [26,] 5.00000010 1.000000228333353
    > # [27,] 5.00000011 1.000000251166690
    > # [28,] 5.00000012 1.000000274000027
    > # [29,] 5.00000013 1.000000296833365
    > # [30,] 5.00000014 1.000000319666704
    > # [31,] 5.00000015 1.000000342500042

    > All the values in the second column should be close to 1, but some are 0 and
    > some are 5. The reason seems to be in the line 112 of src/nmath/choose.c,
    > which is
    > if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */
    > If n is not exactly an integer, then k becomes also non-integer. 
    > Since the code relies on k being an exact integer, we get an error
    > as follows.

    > If n = k - eps for a small positive eps, then the next line 113
    > if (k <  0) return 0.;
    > determines the output since k is now - eps.

    > If n = k + eps for a small positive eps, then the line 116
    > r = n;
    > determines the output, since now k = eps, so the condition j <= k is not 
    > satisfied already at the beginning of the next cycle.

    > I suggest two solutions, a more conservative one and a less conservative one.

    > A more conservative solution is to replace the line 112 by
    > if(n-k < k && n >= 0 && R_IS_INT(n)) k = floor(n - k + 0.5); /* <- Symmetry */
    > This yields the following in the same test as above.

    > #                n                  
    > #  [1,] 4.99999985 0.999999657500042
    > #  [2,] 4.99999986 0.999999680333370
    > #  [3,] 4.99999987 0.999999703166698
    > #  [4,] 4.99999988 0.999999726000027
    > #  [5,] 4.99999989 0.999999748833356
    > #  [6,] 4.99999990 0.999999771666685
    > #  [7,] 4.99999991 1.000000000000000
    > #  [8,] 4.99999992 1.000000000000000
    > #  [9,] 4.99999993 1.000000000000000
    > # [10,] 4.99999994 1.000000000000000
    > # [11,] 4.99999995 1.000000000000000
    > # [12,] 4.99999996 1.000000000000000
    > # [13,] 4.99999997 1.000000000000000
    > # [14,] 4.99999998 1.000000000000000
    > # [15,] 4.99999999 1.000000000000000
    > # [16,] 5.00000000 1.000000000000000
    > # [17,] 5.00000001 1.000000000000000
    > # [18,] 5.00000002 1.000000000000000
    > # [19,] 5.00000003 1.000000000000000
    > # [20,] 5.00000004 1.000000000000000
    > # [21,] 5.00000005 1.000000000000000
    > # [22,] 5.00000006 1.000000000000000
    > # [23,] 5.00000007 1.000000000000000
    > # [24,] 5.00000008 1.000000000000000
    > # [25,] 5.00000009 1.000000000000000
    > # [26,] 5.00000010 1.000000228333353
    > # [27,] 5.00000011 1.000000251166690
    > # [28,] 5.00000012 1.000000274000027
    > # [29,] 5.00000013 1.000000296833365
    > # [30,] 5.00000014 1.000000319666704
    > # [31,] 5.00000015 1.000000342500042

    > So, the extreme values 0 and 5 do not occur, but there is an interval
    > of constant values and on both ends of this interval, there is a jump
    > much larger than machine accuracy.

    > A less conservative solution (a patch is in an attachment) replaces
    > lines 110 - 121 by

    > if (k < k_small_max) {
    > int j;
    > if(n-k < k && n >= 0 && (n == floor(n))) k = n-k; /* <- Symmetry */
    > if (k <  0) return 0.;
    > if (k == 0) return 1.;
    > /* else: k >= 1 */
    > r = n;
    > for(j = 2; j <= k; j++) {
    > r *= (n-j+1);
    > r /= j;
    > }
    > return r;
    > }

    > Here, the symetry is used only for exact integers. The symetry is
    > valid only for integers, since choose(n, k) is a polynomial of
    > degree k. So, for different values of k, we get a different functions
    > on any interval of positive length.

    > The division at the line
    > r /= j;
    > always produces an exact integer, since during the whole cycle, r is always
    > either an integer binomial coefficient or its integer multiple. So, the
    > division has an integer result and there is no rounding error unless the
    > intermediate result does not fit into 53 bit precison.

    > Using the above code, we get

    > options(digits=15)
    > n <- 5 + (-15:15)*1e-8
    > cbind(n, choose(n, 5))
    > #                n                  
    > #  [1,] 4.99999985 0.999999657500042
    > #  [2,] 4.99999986 0.999999680333370
    > #  [3,] 4.99999987 0.999999703166698
    > #  [4,] 4.99999988 0.999999726000027
    > #  [5,] 4.99999989 0.999999748833356
    > #  [6,] 4.99999990 0.999999771666685
    > #  [7,] 4.99999991 0.999999794500014
    > #  [8,] 4.99999992 0.999999817333345
    > #  [9,] 4.99999993 0.999999840166677
    > # [10,] 4.99999994 0.999999863000008
    > # [11,] 4.99999995 0.999999885833339
    > # [12,] 4.99999996 0.999999908666670
    > # [13,] 4.99999997 0.999999931500002
    > # [14,] 4.99999998 0.999999954333335
    > # [15,] 4.99999999 0.999999977166667
    > # [16,] 5.00000000 1.000000000000000
    > # [17,] 5.00000001 1.000000022833334
    > # [18,] 5.00000002 1.000000045666667
    > # [19,] 5.00000003 1.000000068500001
    > # [20,] 5.00000004 1.000000091333336
    > # [21,] 5.00000005 1.000000114166671
    > # [22,] 5.00000006 1.000000137000006
    > # [23,] 5.00000007 1.000000159833342
    > # [24,] 5.00000008 1.000000182666680
    > # [25,] 5.00000009 1.000000205500016
    > # [26,] 5.00000010 1.000000228333353
    > # [27,] 5.00000011 1.000000251166690
    > # [28,] 5.00000012 1.000000274000027
    > # [29,] 5.00000013 1.000000296833365
    > # [30,] 5.00000014 1.000000319666704
    > # [31,] 5.00000015 1.000000342500042

    > Within the chosen accuracy 1e-8, there is no visible jump.
    > With 1e-15, we get

    > options(digits=15)
    > n <- 5 + (-15:15)*1e-15
    > cbind(n, choose(n, 5))
    > #                      n                  
    > #  [1,] 4.99999999999998 0.999999999999966
    > #  [2,] 4.99999999999999 0.999999999999967
    > #  [3,] 4.99999999999999 0.999999999999970
    > #  [4,] 4.99999999999999 0.999999999999972
    > #  [5,] 4.99999999999999 0.999999999999976
    > #  [6,] 4.99999999999999 0.999999999999978
    > #  [7,] 4.99999999999999 0.999999999999980
    > #  [8,] 4.99999999999999 0.999999999999982
    > #  [9,] 4.99999999999999 0.999999999999984
    > # [10,] 4.99999999999999 0.999999999999986
    > # [11,] 4.99999999999999 0.999999999999988
    > # [12,] 5.00000000000000 0.999999999999990
    > # [13,] 5.00000000000000 0.999999999999994
    > # [14,] 5.00000000000000 0.999999999999996
    > # [15,] 5.00000000000000 0.999999999999998
    > # [16,] 5.00000000000000 1.000000000000000
    > # [17,] 5.00000000000000 1.000000000000002
    > # [18,] 5.00000000000000 1.000000000000004
    > # [19,] 5.00000000000000 1.000000000000006
    > # [20,] 5.00000000000000 1.000000000000010
    > # [21,] 5.00000000000001 1.000000000000012
    > # [22,] 5.00000000000001 1.000000000000014
    > # [23,] 5.00000000000001 1.000000000000016
    > # [24,] 5.00000000000001 1.000000000000018
    > # [25,] 5.00000000000001 1.000000000000020
    > # [26,] 5.00000000000001 1.000000000000022
    > # [27,] 5.00000000000001 1.000000000000024
    > # [28,] 5.00000000000001 1.000000000000028
    > # [29,] 5.00000000000001 1.000000000000031
    > # [30,] 5.00000000000001 1.000000000000032
    > # [31,] 5.00000000000002 1.000000000000035

    > So, the jump in the exactly integer value n[16] == 5 is comparable
    > to machine accuracy.

    > I would be pleased to provide more tests, if some of the above solutions
    > or their modifications can be accepted.

    > Petr Savicky.

    [.......................]

Thanks a lot, Petr, for your explorations.
I agree that at least something like your conservative change
should happen.

Personally, I'd agree with you and prefer the "progressive"
(non-conservative) change,
basically getting rid of  all  R_IS_INT(n)
tests, which seem kludgey.

OTOH, the R core members who did start using R_IS_INT(n)
must have had good reasons with another set examples.

Have you tried 'make check-devel' (or 'check-all') also with the 
progressive change?

I'll hope to get back to this on Monday.
Martin