[R] logistic regression for a data set with perfect separation

David Firth david.firth at nuffield.oxford.ac.uk
Wed Sep 10 20:39:39 CEST 2003

```On Wednesday, Sep 10, 2003, at 18:50 Europe/London, Christoph Lehmann
wrote:

> Dear R experts
>
> I have the follwoing data
>           V1 V2
> 1 -5.8000000  0
> 2 -4.8000000  0
> 3 -2.8666667  0
> 4 -0.8666667  0
> 5 -0.7333333  0
> 6 -1.6666667  0
> 7 -0.1333333  1
> 8  1.2000000  1
> 9  1.3333333  1
>
> and I want to know, whether V1 can predict V2: of course it can, since
> there is a perfect separation between cases 1..6 and 7..9
>
> How can I test, whether this conclusion (being able to assign an
> observation i to class j, only knowing its value on Variable V1)  holds
> also for the population, our data were drawn from?

For this you really need more data.  The only way you'll ever be able
to reject that hypothesis is by finding an instance of 010 or 101 in
the (ordered by V1) sample.  And if you find such then you can reject
with certainty.

>
> Means, which inference procedure is recommended? Logistic regression,
> for obvious reasons makes no sense.

Not so obvious to me!  Logistic regression still makes sense, but care
is needed in the method of estimation/inference.  The maximum
likelihood solution in the above case is a model which says V2 is 1
with certainty at some values of V1, and is zero with certainty at
other values; and that seems an unwarranted inference with so little
data.  That's a criticism of maximum likelihood, rather than a
criticism of logistic regression.  (Think about the more extreme
situation of tossing a coin once: if a head is observed, the ML
solution is that the coin lands heads with certainty, ie that there no
chance of tails.)

There are alternative (Bayesian and pseudo-Bayesian) methods of
inference which can yield more sensible answers in general.  [One such
is implemented in package brlr ("bias reduced logistic regression") on
CRAN.]  To "test" the hypothesis described above, though, with the data
you have, would seem to require a fully Bayesian analysis whose
conclusions would depend strongly on the prior probability attached to
the hypothesis.  ie you need more data...

I hope that helps in some way!

Regards,
David

>
> Many thanks for your help
>
> Christoph
> --
> Christoph Lehmann <christoph.lehmann at gmx.ch>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help

```