[R] logistic regression in an incomplete dataset

Tue Apr 6 00:19:04 CEST 2010

Dear Emmanuel,

Thank you.

Yes I broadly agree with what you say.
I think ML is a better strategy than complete case, because I think its
estimates will be more robust than complete case.
For unbiased estimates I think
  ML requires the data is MAR,
  complete case requires the data is MCAR

Anyway I would have thought ML could be done without resorting to Multiple
Imputation, but I'm at the edge of my knowledge here.

Thanks once again,

regards
Desmond

From: Emmanuel Charpentier <charpent <at> bacbuc.dyndns.org>
Subject: Re: logistic regression in an incomplete dataset
Newsgroups: gmane.comp.lang.r.general
Date: 2010-04-05 19:58:20 GMT (2 hours and 10 minutes ago)

Dear Desmond,

a somewhat analogous question has been posed recently (about 2 weeks
ago) on the sig-mixed-model list, and I tried (in two posts) to give
some elements of information (and some bibliographic pointers). To
summarize tersely :

- a model of "information missingness" (i. e. *why* are some data
missing ?) is necessary to choose the right measures to take. Two
special cases (Missing At Random and Missing Completely At Random) allow
for (semi-)automated compensation. See literature for further details.

- complete-case analysis may give seriously weakened and *biased*
results. Pairwise-complete-case analysis is usually *worse*.

- simple imputation leads to underestimated variances and might also
give biased results.

- multiple imputation is currently thought of a good way to alleviate
missing data if you have a missingness model (or can honestly bet on
MCAR or MAR), and if you properly combine the results of your
imputations.

- A few missing data packages exist in R to handle this case. My ersonal
selection at this point would be mice, mi, Amelia, and possibly mitools,
but none of them is fully satisfying(n particular, accounting for a
random effect needs special handling all the way in all packages...).

- An interesting alternative is to write a full probability model (in
BUGS fo example) and use Bayesian estimation ; in this framework,
missing data are "naturally" modeled in the model used for analysis.
However, this might entail *large* work, be difficult and not always
succeed (numerical difficulties. Furthermore, the results of a Byesian
analysis might not be what you seek...

HTH,

					Emmanuel Charpentier

Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit :
> Dear all,
>
> I want to do a logistic regression.
> So far I've only found out how to do that in R, in a dataset of complete
cases.
> I'd like to do logistic regression via max likelihood, using all the
study cases (complete and
incomplete). Can you help?
>
> I'm using glm() with family=binomial(logit).
> If any covariate in a study case is missing then the study case is
dropped, i.e. it is doing a complete cases analysis.
> As a lot of study cases are being dropped, I'd rather it did maximum
likelihood using all the study cases.
> I tried setting glm()'s na.action to NULL, but then it complained about
NA's present in the study cases.
> I've about 1000 unmatched study cases and less than 10 covariates so
could use unconditional ML
estimation (as opposed to conditional ML estimation).
>
> regards
> Desmond
>
>
> --
> Desmond Campbell
> UCL Genetics Institute
> D.Campbell at ucl.ac.uk
> Tel. ext. 020 31084006, int. 54006
>
>