[Rd] Problems in d-p-q-r functions (was PR#8528, but unrelated)
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Feb 7 09:52:43 CET 2006
For the record, some of these claims are untrue:
> df(0, 2, 2)
[1] 1
> df(0, 1.3, 2)
[1] Inf
> x <- 1e-170
> pbeta(x, x, x)
[1] 0.5
qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is an error, and so is
qnbinom(1E-300,0.000002,10000000000)
Here we can guess at what you meant, maybe correctly. There were comments
in the source code about needing a better search, and I have recently
implemented one. So
> qnbinom(1e-10, 1e3, 1e-7) # instant
[1] 8117986721
> qnbinom(0.5, 10000000000, 0.000000002)
[1] 5e+18
> qnbinom(1e-300, 10000000000, 0.000002)
[1] 4.998138e+15
seem to be solved.
There were two problems with dbeta, one easily overcome (f underflows) the
other fundamental to the way dbinom_raw is computed (n*p can underflow).
What I cannot see is why a formula which worked correctly in this region
was replaced by one that did not. It is precisely in order not to
generate such errors that I used TOMS 708 only in the area where the
existing algorithm is problematic. (It may be better elswehere, but I did
not have the time to do the requisite analysis. It seems neither do some
other people: I would prefer not to spend the time to clear up after such
unneeded changes.)
On pt, thank you for the report. pt(x, df=1) is not interesting for
|x| > 1e150, but it is for smaller values of df and those were
underflowing. It is easy to use an asymptotic formula to regain the
accuracy.
R was potentially generating reports of lack of convergence and loss of
accuracy in quite a number of its algorithms, but for reasons unknown to
me these were being buried (ML_ERROR did nothing, and has not for a very
long time). It's a matter of debate whether in some of these it would be
better to return NaN as well, but warnings should have been generated (and
now are).
As for `panic' (your word: why is it 'panic' to submit a correct bug
report?), a major platform returning +Inf for a log probability is very
bad news, as is another failing a regression test by getting NaN for a
probability which is 0.5.
On Sat, 28 Jan 2006 IandJMSmith at aol.com wrote:
> On 23/02/05 I suggested that given R had included TOMS 708 to correct for t=
> he=20
> poor performance of pbeta, TOMS 654 should be included to fix all the pgamm=
> a=20
> problems. I was slightly surprised to find Morten's code had been included=
> =20
> instead 2 days later. I noticed but did not worry that the reference to me =
> had=20
> been removed.=20
>
> The derivation of the asymptotic expansion for the gamma distribution used =
> by=20
> Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm=
> =20
> It is fairly easy to understand and find error bounds for and hence include=
> =20
> sensibly in an algorithm to calculate pgamma.
>
> The basis and accuracy of the some of the algorithms I use is discussed in=
> =20
> http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute =
> error=20
> in the log of the probability gives a good indication of the accuracy of yo=
> ur=20
> answer. In the least extreme example you consider=20
> (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and=
> if you exponentiated the result=20
> the relative error would be about 10 to the power 2328042. So the answer yo=
> u=20
> wish to calculate is K times 10 to the power -2.32804237034994E+22, where K=
> is=20
> somewhere between 10 to the power plus or minus 2328042. In other words whe=
> n=20
> you make the changes to correct this problem, your calculation will still=
> =20
> return values with no real meaning but at least users might be aware of thi=
> s which=20
> would be no bad thing! For me this answer is possibly so meaningless that N=
> an=20
> would be preferable.
>
> I did mention to Morten that I had updated my code but I believed that for=
> =20
> Gnumeric he was quite satisfied with what he had. If you look at the VBA co=
> de at=20
> http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I=
> =20
> made to stop the overflow problems you seem to be worried about. My code fo=
> r the=20
> pdf of the gamma distribution still fails for shape parameters > 2e307 due =
> to=20
> multiplication of the shap parameter by 2pi. The code for dgamma will have =
> the=20
> same problem unless it is hidden by use of an 80 or more bit floating point=
> =20
> processor. The code for the asymptotic expansion for the gamma distribution=
> =20
> seems to be fine for any number, excluding silly ones like Nan and Inf. Ind=
> eed it=20
> takes the difference from the mean as a parameter and if you supply an=20
> accurate value you get a sensible answer as mentioned in=20
> http://members.aol.com/iandjmsmith/Accuracy.htm
>
> I do not share your apparent sense of panic on this matter. I have no=20
> problems with error signals like NaNs because it is obvious to the user tha=
> t things=20
> have gone wrong. Inaccurate answers when the user has no reason to expect t=
> hem=20
> are usually far more difficult to spot and in many cases the results are ju=
> st=20
> believed. That for me is a serious problem. I think you will find that the=
> =20
> pgamma code of 2.0.0 did not work for small shape parameters (similar to th=
> e=20
> problems exhibited by pbeta still for small parameters see PR#2894), was=20
> inaccurate for large shape parameters (> 1e5) when it resorted to the norma=
> l=20
> approximation and was pretty slow in between. Indeed, the normal approximat=
> ion was the=20
> cause of PR#7307.
>
>
> I don't understand your comments about=20
> "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x=
> )/x) -=20
> (lambda-x)=20
> and naively assumes that this is small enough to use a power series expansi=
> on=20
> in 1/x with coefficients as powers of pt_. To make matters worse, consider =
> =E2=80=A6"
> In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin=
> k=20
> it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If =
> |x|=20
> < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20
> log1p(1+x)-x and for other values it uses a continued fraction which essent=
> ially=20
> evaluates more of the same series used when |x| < .01.
>
> Your comments about replacing logspace_add with logspace_sub with simpler=
> =20
> code which works at first sight to be a very sensible improvement. However,=
> I=20
> would be a bit nervous that lnd-lnp could be very large and the exp of it c=
> ould=20
> return infinity. I'm sure this can be accounted for in the code and lnp +=
> =20
> log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.
>
> I am not responsible for the code for calculating the logs of probabilities=
> =20
> but I seem to remember that the extremely poor performance of the algorithm=
> s in=20
> R2.0.0 with logged probabilities was one of the reasons Morten became=20
> interested in changing the pgamma code (see PR#7307). I have had a quick lo=
> ok and=20
> once the corections mentioned above are made it should be giving nonsense a=
> nswers=20
> with no difficulty.
>
>
> Unfortunately there are still a few examples of sloppy coding and accuracy=
> =20
> errors remaining in R.
>
> The non-central distribution functions have horrible 1- cancellation errors=
> =20
> associated with them (see PR#7099) and separate code is required for the tw=
> o=20
> tails of the distributions to get round the problem.
>
> The fix for PR#8251 is a kludge and just moves the inaccuracies to examples=
> =20
> with higher non-centrality parameters.
>
> pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20
> implementations. pcauchy works but I believe the pt function is also suppos=
> ed to work for=20
> non integral degrees of freedom so making it work one degree of freedom via=
> =20
> pcauchy is hardly much use.
>
> qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the
parameters,=
> =20
> qnbinom can be made very slow indeed. I do not think there is anything wron=
> g with=20
> the Cornish-Fisher expansion. It just seems that it is not always very good=
> =20
> for the Negative Binomial distribution. In the example above, the initial=
> =20
> approximation is out by 2e6.
>
> A slightly different problem which can cause qnbinom and qbinom to go into=
> =20
> infinite loops is when the q-value is too big. For example=20
> qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app=
> rox but the code works=20
> with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec=
> uted but=20
> does not alter the value of y.
>
> df(0,2,2,FALSE) should be 1 not 0
> df(0,df,2,FALSE) should be infinity for df < 2 not 0
> dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0
>
> Presumably R also has similar problems with the pbeta function. As I recall=
> =20
> the TOMS 708 code was pretty much included without edits and therefore didn=
> 't=20
> calculate logs of probabilities except by calculating the probability and t=
> hen=20
> logging it. I assumed this was why it was not used for small shape paramete=
> rs=20
> where the current code does not work, although it did not seem logical to m=
> e.=20
> Of course, my memory is not what it was but if that is the case and there a=
> re=20
> problems with modifying the TOMS code, you could try an
asymptotic expansio=
> n=20
> based on http://members.aol.com/iandjmsmith/BinomialApprox.htm
>
> This response has been very rushed. I do not write well when I have plenty =
> of=20
> time and I felt I had so many different things to say so I apologise if it =
> is=20
> all a bit of a jumble.=20
>
> Ian Smith
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list