[R] modeling binary response variables

Tue Jul 15 03:34:48 CEST 2008

Wait, are the proportions (probabilities) based on discrete data, or are
they truly continuous? If the latter, then beta regression might be more
appropriate (e.g. package betareg). If the former, include the sample
size for each proportion in the call to glm using the weights= argument.
Or set the data up so you have a column of numbers of "successes" and a
column of "failures" and use the notation below. Multiplying your
proportion by an arbitrary large number is bad because you are in effect
fudging the precision of the proportion estimates.

HTH,

Simon.

On Mon, 2008-07-14 at 18:07 -0700, Daniel Malter wrote:
> Hi Kevin, you mean an s-shaped relationship of a variable with your response?
> So you have a response that is strictly constrained to the interval 0,1 or,
> and these limits are not due to truncation or censoring (i.e. your response
> variable is truly a proportion).
> 
> This sounds like a good application for a binomial model as fitting a linear
> model may give you a fit outside the limits of the interval that you are
> allowed to observe (0,1). The binomial logit (or probit, or cloglog) fixes
> that issue.
> 
> Since you have a proportion (the probability of success), you have something
> between 0 and 1. I suggest you to transform that by multiplying that
> proportion by say 100 (or 1000). Then you round this value to the next
> integer. Say Y is currently your proportion, do new.Y=round(Y*100). Then you
> create the number of observations that make up the counter-probability of
> your observation. counter.Y=100-Y.
> 
> Then you can run the binomial as follows:
> 
> reg=glm(cbind(new.Y,counter.Y)~predictors,binomial) ##runs the regression
> summary(reg) ##shows the summary output of your regression
> fitted(reg) ##shows the predicted values given your data matrix and your
> estimated model
> 
> You will want to check a.) whether you need a binomial (if your
> probabilities are actually reasonably distributed in a much smaller interval
> than 0,1, then you may be okay with a linear model).
> b.) if a binomial is more appropriate, you will want to check whether your
> data is overdispersed. Look at whether your degrees of freedom in the
> summary of your model are about equal to the log-likelihood of the model. If
> not, choose option quasibinomial instead of option binomial when fitting the
> model.
> 
> Best,
> Daniel
> 
> 
> 
> Kevin J Emerson wrote:
> > 
> > R-devotees,
> > 
> > I have a question about modeling in the case where the response variable
> > is
> > binary.
> > 
> > I have a case where I have a response variable that is the probability of
> > success, and four descriptor variables, The response has a sigmoid
> > response
> > with one of the variables. I would like to test for the effect of the
> > various descriptor variables on the percentage success of the binary
> > trait.
> > I have looked at glm with family = "binomial" but am not sure I totally
> > understand its use (and therefore am not sure it is the appropriate test)
> > and am looking for two things: (1) is glm with family = 'binomial' the
> > right
> > way to do this, and (2) are there any good references on how it works.
> > I have posted a plot of a sample of the data I am looking at as well as
> > the
> > sample data used to generate the plots.
> > 
> > Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
> > Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv
> > 
> > Response variable is percent.dev (se2.dev are the errors from binomial
> > estimates given probability and number of samples).
> > 
> > Descriptor variables are num.days, ppd, temp, and pop.  
> > 
> > Any help would be greatly appreciated.
> > 
> > Cheers,
> > Kevin Emerson
> > 
> > 
> > ====================================
> > Kevin J. Emerson
> > Bradshaw - Holzapfel Lab
> > 1210 University of Oregon
> > Eugene, OR, 97403
> > email: kemerson at uoregon.edu
> > web: http://evodevo.uoregon.edu/people/emerson.html
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > 
> 
-- 
Simon Blomberg, BSc (Hons), PhD, MAppStat. 
Lecturer and Consultant Statistician 
Faculty of Biological and Chemical Sciences 
The University of Queensland 
St. Lucia Queensland 4072 
Australia
Room 320 Goddard Building (8)
T: +61 7 3365 2506
http://www.uq.edu.au/~uqsblomb
email: S.Blomberg1_at_uq.edu.au

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

The combination of some data and an aching desire for 
an answer does not ensure that a reasonable answer can 
be extracted from a given body of data. - John Tukey.