[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

Wed May 8 18:36:08 CEST 2019

On May 8, 2019, at 05:46 EDT, Martin Maechler wrote:
> [ __ to R-help __ -- here diverted to R-devel on purpose]

Thank you very much. I posted it to R-help because I was not sure I would be able to post to R-devel (will try it here)

> What we should *not* do, I think,  is to change the default behavior, even though 'na.rm=FALSE' *is* the default in many other R functions, including your example mean().

I completely agree that the default behavior should not change.

> ... it seems your "wishlist item" should be fulfillable without too much more effort.

Thank you so much for investing that time and effort.

> ... we should eventually also think about what  'na.rm=FALSE'  should/would mean for the degree 3 interpolation splines provided by spline() and splinefun().

I think that is somewhat less clear. The advantage of constant and linear interpolation is that the interpolation is local. So if one value is NA, you can have the interpolation function be NA only in the small interval that is affected by that.

For splines (at least if one requires a continuous second derivative), the model is not local. I think that spline interpolation is a case where it might make sense to simply delete the points with NA values.

> Assume one y missing, say y[k] and we don't want to just drop the (x[k],y[k]).  This then is in some sense equivalent to having _two_ separate sets of interpolation points:
> 1)  (x[i], y[i]),  i = 1..(k-1)
> 2)  (x[i], y[i]),  i = (k+1)..n

I am not sure that interpreting missing values as "internal endpoints" is necessarily the best way to look at it. For example, one might want to use different values of the "rule" parameter for internal endpoints than for the "real" endpoints. And as you point out, a gap in the middle is off the right end of the left set, and the left end of the right set, so which value of "rule" should be used?

And spline interpolation often does special things at the endpoints, like imposing a zero second derivative. That might not necessarily be what you want at an internal gap.

> *) E.g., what should happen with  na.rm=FALSE and NA's in x[] ?

NA's in x[] is a really tricky question and I don't think there is a good answer. For an NA in y[] with x[] not NA, the statement is "The y value at this particular location is missing or undefined, so let's mark as undefined only the interpolated values that depend explicitly on the missing value." A NA in x[], especially if the corresponding y[] is not NA, means "There is an additional data point, but we have no idea where it is. So any of our interpolated values anywhere might be wrong."

I think the only appropriate behavior is either (a) to remove points for which x[] is NA (if na.rm=TRUE), or (b) set all interpolated values to NA if any x[] is NA (if na.rm=FALSE). Or one could throw an error, which is sort of the same as (b) but can get annoying for large data sets.

Handling missing values in general is a hard problem, with more aspects than can be built into an interpolation function. With na.rm=TRUE, the mean() function assumes that missing values have the same mean as the rest of the data set; the sum() function assumes that missing values are zero (so they do not contribute to the sum).

The current behavior of the interpolation functions, in the proposed default case of na.rm=TRUE, in effect assumes that the missing values are whatever is consistent with the neighboring values. With na.rm=FALSE, they would keep the values, but set to NA only the local values.

                                    --Robert Almgren
--
Quantitative Brokers         http://www.quantitativebrokers.com

-- 

CONFIDENTIALITY NOTICE: This e-mail and any attachments=...{{dropped:23}}