[R] lm() silently drops NAs

Sat Jul 30 21:06:25 CEST 2016

>>>>> peter dalgaard <pdalgd at gmail.com>
>>>>>     on Tue, 26 Jul 2016 23:30:31 +0200 writes:

    >> On 26 Jul 2016, at 22:26 , Hadley Wickham
    >> <h.wickham at gmail.com> wrote:
    >> 
    >> On Tue, Jul 26, 2016 at 3:24 AM, Martin Maechler
    >> <maechler at stat.math.ethz.ch> wrote:
    >>> 
    > ...
    >> To me, this would be the most sensible default behaviour,
    >> but I realise it's too late to change without breaking
    >> many existing expectations.

    > Probably.

Well,.... yes, Hadley's proposal would change too many default outputs;
Still, I'd tend think we at least should contemplate changing
the default (mildly).  Before doing that we could at least provide one
or a few new  na.* functions which people could try using  as their own
defaults, using options(na.action = *) __ or __ as I read in the
white book (see below) by

      attr(<data.frame>, "na.action") <-  na.<myfavorite>

That is actually something that I have not seen advertised much,
but seems very reasonable to me.

It often does make sense to set a dataset-specific NA handling.

I agree (with Hadley's implicit statement) that 'na.exclude' is
often much more reasonable than 'na.omit' and that was the
reason it was invented *after* the white book, IIRC, for S-PLUS,
not S; and quite advertized at the time (at least by some people
to whom I listened ...)

If we'd change the default na.action , we'd probably would not want
to go  as far as a version of  Hadley's (corrected in 2nd e-mail) 

   na.warn <- function(object, ...) {
     missing <- !complete.cases(object)
     if (any(missing)) {
	warning("Dropping ", sum(missing), " rows with missing values", call. = FALSE)
     }
     na.exclude(object, ...)
   }

[the warning probably needs to be a bit smarter: you don't always have
 "rows", sometimes it's "entries" in a vector]

but *providing* something like the above, possibly rather named
along the line of

	'na.warn.and.keep'

and provide a  'na.warn.and.omit'  (which I would call
'na.warn', as after all it would be very close to the current R
default, na.omit, but would warn additionally..

    > Re. the default choice, my recollection is that at the
    > time the only choices available were na.omit and na.omit.

    > S-PLUS was using na.fail for all the usual good
    > reasons, but from a practical perspective, the consequence
    > was that, since almost every data set has NA values, you
    > got an error unless you added na.action=na.omit to every
    > single lm() call. And habitually typing na.action=na.omit
    > doesn't really solve any of the issues with different
    > models being fit to different subsets and all that. 

yes, indeed; you are right:

- The white book (p. 112-113) indeed explains that 'na.fail' is
  (i.e., was there) the default.

- an important example of where na.omit was not good enough, at
  the time was stepwise regression (where the number of complete
  cases may depend on the current subset of predictor
  variables), or also bootstrapping and crossvalidation.

    > So the
    > rationale for doing it differently in R was that it was
    > better to get some probably meaningful output rather than
    > to be certain of getting nothing. And, that was what the
    > mainstream packages of the time were doing.

    >> On a related note, I've never really understood why it's
    >> called na.exclude - from my perspective it causes the
    >> _inclusion_ of missing values in the
    >> predictions/residuals.

    > I think the notion is that you exclude them from the
    > analysis, but keep them around for the other purposes.

    > -pd

    >> 
    >> Thanks for the (as always!) informative response, Martin.
    >> 
    >> Hadley
    >> 
    >> -- 
    >> http://hadley.nz
    >> 
    >> ______________________________________________
    >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
    >> more, see https://stat.ethz.ch/mailman/listinfo/r-help
    >> PLEASE do read the posting guide
    >> http://www.R-project.org/posting-guide.html and provide
    >> commented, minimal, self-contained, reproducible code.

    > -- 
    > Peter Dalgaard, Professor, Center for Statistics,
    > Copenhagen Business School Solbjerg Plads 3, 2000
    > Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23
    > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com