# [R] logistic regression for a data set with perfect separati

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Wed Sep 10 21:24:06 CEST 2003

```On 10-Sep-03 Christoph Lehmann wrote:
> I have the follwoing data
>           V1 V2
> 1 -5.8000000  0
> 2 -4.8000000  0
> 3 -2.8666667  0
> 4 -0.8666667  0
> 5 -0.7333333  0
> 6 -1.6666667  0
> 7 -0.1333333  1
> 8  1.2000000  1
> 9  1.3333333  1
>
> and I want to know, whether V1 can predict V2: of course it can, since
> there is a perfect separation between cases 1..6 and 7..9
>
> How can I test, whether this conclusion (being able to assign an
> observation i to class j, only knowing its value on Variable V1)  holds
> also for the population, our data were drawn from?
>
> Means, which inference procedure is recommended? Logistic regression,
> for obvious reasons makes no sense.

This is not so much an R question, nor really a "which procedure"
question, since standard procedures are not usually equipped to deal
with such situations (beyond telling you in some way that the situation
is "out of bounds").

However, you can certainly investigate it by writing little R programs
to look at it from various points of view.

Let 'm' denote the location parameter for the CDF which models the
probability, and 's' the scale parameter (e.g. a logistic function).

For a start, clearly the maximum of the likelihood is 1, achieved when
s=0 and m is any value between -0.7333.. and -0.1333..

You can investigate the variation of the likelihood as m and s vary
by evaluating expressions like

m<-(-.07);s<-1.0;L<-plogis((V1-m)/s);2*sum(V2*log(L)+(1-V2)*log(1-L))

For instance, for any value of s>0, find the value of m which maximises
this. Then you can get an indication about your question by looking
for the value of s such that this maximised value (with sign changed)
is just on (say) the 5% point of a chisq[df=1]; my gropings suggest
that s=0.8, m=(-0.1) (approx). This gives you a pair (m,s) which is
just consistent with your data by this criterion. What, for instance,
is the probability for any value of V1 that V2=1/0?

E.g. for m=-0.1,s=0.8, consider the range -2 <= x <=2 (step=0.1):
m<-(-0.10);s<-0.8;x<-0.1*(-20:20);L<-plogis((x-m)/s);L
 0.08509905 0.09534946 0.10669059 0.11920292 0.13296424 0.14804720
 0.16451646 0.18242552 0.20181322 0.22270014 0.24508501 0.26894142
 0.29421497 0.32082130 0.34864514 0.37754067 0.40733340 0.43782350
 0.46879063 0.50000000 0.53120937 0.56217650 0.59266660 0.62245933
 0.65135486 0.67917870 0.70578503 0.73105858 0.75491499 0.77729986
 0.79818678 0.81757448 0.83548354 0.85195280 0.86703576 0.88079708
 0.89330941 0.90465054 0.91490095 0.92414182 0.93245331

so that P(V2=1) can be substantial (>0.1) for V1 as low as -1.8,
and P(V2=0) likewise for V2 as high as +1.6; yet this (m,s) is not
question, it would seem that your data do not support the generalisation

And so on; you can plot things out, etc. You can do a simulation study:
for a given (m,s), say the pair above, and a set of V1 values like those
which you have, what is the probability that you get a set of results
(V2) which show "perfect separation"?:-- find what proportion of
simulations satisfy

max(which(V1[V2==0])) < min(which(V1[V2==1]))

Explore a grid of (m,s) values and estimate this proportion; smooth the
estimates and plot a contour diagram ... and so on!

Use R as a tool for questions like this, and do not necessarily expect to
find a procedure which is tailor-made for (e.g.) this particular question!

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 10-Sep-03                                       Time: 20:24:06
------------------------------ XFMail ------------------------------

```