[R] OT: (quasi-?) separation in a logistic GLM

Tue Dec 16 13:31:32 CET 2008

dear Gavin,
I do not know whether such comment may be still useful..

Why are you unsure about quasi-separation?
I think that it is quite evident in the plot

plot(analogs ~ Dij, data = dat)

Also it may be useful to see the plot of the monotone (profile) deviance 
(or the log-lik) for the coef of Dij,

xval<-seq(-20,0,l=50)
ll<-vector(length=50)
for(i in 1:length(xval)){
mod <- glm(analogs ~ offset(xval[i]*Dij), data = dat, family = binomial)
ll[i]<-mod$dev
}

plot(xval, ll)

Hope this helps you,

vito

Gavin Simpson ha scritto:
> Dear List,
> 
> Apologies for this off-topic post but it is R-related in the sense that
> I am trying to understand what R is telling me with the data to hand.
> 
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same "type"
> or not. Given the bashing that ROC curves get whenever anyone asks about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
> 
> The data I have then are a logical vector ('analogs') indicating whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
> 
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources that
> discuss this warning and R (Faraway's Extending the Linear Model with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in this
> case.
> 
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
> 
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
> 
> #### Begin code snippet ################################################
> 
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
> 
> #### End code snippet ##################################################
> 
> Thanks in advance
> 
> Gavin

-- 
====================================
Vito M.R. Muggeo
Dip.to Sc Statist e Matem `Vianelli'
Università di Palermo
viale delle Scienze, edificio 13
90128 Palermo - ITALY
tel: 091 6626240
fax: 091 485726/485612
http://dssm.unipa.it/vmuggeo