[R] missing values in logistic regression

Fri Oct 29 12:48:37 CEST 2004

On 29 Oct 2004, Avril Coghlan wrote:

> Dear R help list,
> 
>    I am trying to do a logistic regression
> where I have a categorical response variable Y
> and two numerical predictors X1 and X2. There
> are quite a lot of missing values for predictor X2.
> eg.,
> 
> Y     X1   X2
> red   0.6  0.2    *
> red   0.5  0.2    *
> red   0.5  NA
> red   0.5  NA
> green 0.2  0.1    *
> green 0.1  NA
> green 0.1  NA
> green 0.05 0.05   *
> 
> 
> I am wondering can I combine X1 and X2 in
> a logistic regression to predict Y, using
> all the data for X1, even though there are NAs in
> the X2 data?
> 
> Or do I have to take only the cases for which
> there is data for both X1 and X2? (marked
> with *s above)

You need to either

1) Train separate models for Y | X1 and Y | X1, X2 and use the appropriate 
one.

2) Produce an imputation model for X2 | X1, and use multiple imputation.

Given that the latter look like [0, 1] scores, mix (as suggested by PD) 
is not likely to be appropriate, but e.g. a 2D kde fit may well be.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595