[Rd] Inconsistent behavior of stats::bw.nrd() and stats::bw.nrd0()

Thu Feb 24 11:44:26 CET 2022

>>>>> Noah Greifer 
>>>>>     on Wed, 23 Feb 2022 11:21:18 -0500 writes:

    > Hello R-devel,

    > I noticed an inconsistency in stats::bw.nrd() and stats::bw.nrd0, two
    > functions used to compute the bandwidth for densities. According to the
    > documentation,

    > "bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a
    > Gaussian kernel density estimator. It defaults to 0.9 times the minimum of
    > the standard deviation and the interquartile range divided by 1.34 times
    > the sample size to the negative one-fifth power (= Silverman's ‘rule of
    > thumb’, Silverman (1986, page 48, eqn (3.31))) unless the quartiles
    > coincide when a positive result will be guaranteed.

    > bw.nrd is the more common variation given by Scott (1992), using factor
    > 1.06."

    > This implies the result of bw.nrd() should simply be 1.06/.9 times the
    > result of bw.nrd0(). However, these functions are coded quite differently
    > and, in particular, respond to situations where the data has an IQR of 0
    > differently. The source of bw.nrd0 is

    > function (x)
    > {
    >     if (length(x) < 2L)
    >         stop("need at least 2 data points")
    >     hi <- sd(x)
    >     if (!(lo <- min(hi, IQR(x)/1.34)))
    >         (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    >     0.9 * lo * length(x)^(-0.2)
    > }
    > 
    > and the source of bw.nrd is
    > 
    > function (x)
    > {
    >     if (length(x) < 2L)
    >         stop("need at least 2 data points")
    >     r <- quantile(x, c(0.25, 0.75))
    >     h <- (r[2L] - r[1L])/1.34
    >     1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)
    > }
    > 

> Importantly, when the IQR of the input is 0, bw.nrd0() falls back onto the
> standard deviation, guaranteeing the positive result described in the
> documentation. Whereas, bw.nrd() produces a result of 0. I am not sure
> which result is more desirable, but it would seem to me that they should at
> least be consistent. See examples below:

> > x <- c(1,1,1,1,1000)
> > stats::bw.nrd(x)
> [1] 0
> > stats::bw.nrd0(x)
> [1] 291.4265

> Noah Greifer

This is for historical (and copyright?) reasons only/mostly:

At the time  R 1.0.0 was released (on Feb. 29, 2000 -- a date not
       	       	     	 	   existing in MS Windows 3.11)
there were no bw.*() functions
and density() already contained

    if (missing(bw))
      bw <-
        if(missing(width)) {
            hi <- sd(x)
            if(!(lo <- min(hi, IQR(x)/1.34)))# qnorm(.75) - qnorm(.25) = 1.34898
                (lo <- hi) || (lo <- abs(x[1])) || (lo <- 1.)
            adjust * 0.9 * lo * N^(-0.2)
        } else 0.25 * width

whereas in Nov 1999, it still was

    if (missing(bw))
	bw <-
	    if(missing(width))
		adjust * 0.9 * min(sd (x), IQR(x)/1.34) * N^(-0.2)
	    else 0.25 * width

(which actually *was* Silverman's rule of thumb): He, as Scott
 and all math-statisticians who ever published about the
 problem, was/were never interested in the fact that a good algorithm
 must even work in extreme cases ..)

As a matter of fact, svn trunk rev 6994 (Dec 11, 1999) was a
large "branch update" which changed the 'bw' computation to the
more robust one (never giving 0, BTW, even when sd(x) == 0, not
just IQR(.) == 0).

So, we had already made sure that the default bandwidth 'bw' was
really numerically robust and would always give a positive result.

Then, about 2 years later, with svn r16938, 2001-11-28
by Prof Brian Ripley with message  'add bandwidth-selectors to density()'
he brought in all (I think) of the current bw.*() choices
actually from the MASS book's S/R code by Venables & Ripley,
including the most traditional  bw.nrd() one, as they were
already described in the book;
and kept the default bw for density() as previously,
now modularized in the bw.nrd0() function.

Also note that the documentation (the help page) is more or less
precise here, since as I said above, Scott did neither consider
dealing with the not so uncommon case of IQR() == 0.

But you are right that the help page could be more precise here
and make sure that the robustness applies only to the nrd0, not
to nrd.
I'd  think we'd accept a patch proposal for the
src/library/stats/man/bandwidth.Rd  help file, making this more
unambigous.

Best,
Martin

Martin Maechler
ETH Zurich  and  R Core team