[R] variable types - logistic regression
Joshua Wiley
jwiley.psych at gmail.com
Fri Nov 25 23:33:44 CET 2011
Hi Ben,
The following is oversimplified but hopefully helpful. Regression
only works with numbers. The trick then becomes how to convert
non-numeric data into meaningful numbers. For so-called continuous
data (the type you get from running: rnorm(100) ), nothing needs to be
done. For others (e.g., what you gey from sample(1:5, 100, replace =
TRUE) ), the data may not be truly continuous, but it is often treated
as such (this type is particularly common in the social sciences where
questionnaires and surveys are administered and participants are asked
to rate things on 1 to 5 or 1 to 7, or ... scales.
When you move on to data that is not really continuous and you do not
want to treat as such (say first, second, third place), some schema
has to be used to convert them. Most commonly, contrasts are
used---thus certain levels are contrasted with others. In R, for
ordered factors, the default contrasts are orthogonal polynomials.
For example the contrasts for the first second, third example might
be:
contrasts(factor(1:3, ordered = TRUE))
.L and .Q stand for linear and quadratice, respectively. For k
levels, there will be k - 1 contrast columns. This relaxes the
linearity assuption applied to continuous data by testing the effects
of first, second, etc. order polynomials. If the data have no
meaningful order, say explaining levels of red bull consumption by
college major, the default contrasts applied by R are "dummy codes".
This picks one group (the lowest) as the referent, and compares the
effect of all the other groups, relative to the referent. For
example, suppose we had a small sample of only three college majors:
contrasts(factor(1:3))
1 is the reference group, the first contrast tests the effect of being
in group 2 versus group 1, the second group 3 versus group 1.
All of these work with logistic regression, or any flavour of general
linear model (via the glm() and other functions). In many regards,
the treatment of predictors in logistic regression is not any
different from basic linear regression (ordinary least squares [OLS]).
The logistic functions works on the outcome, not the predictors.
That said, some special considerations do come into play. You need
some variability on all of your predictors. In OLS with truly
continuous data, if you have a two level nominal predictor with some
people in each level, it is unlikely that any given cell would have
all the same values. However, with a 0/1 outcome and a 0/1 predictor,
it may be that in one particular cell, everyone has either a 0 or 1
for the outcome, which can be problematic for estimation purposes.
What sorts of data are you dealing with? Is just entering the
variables or using factor() not doing what you expect with some? I
have not looked at the web page you referenced much but if you have an
example type of data you feel is not covered or would like more fully
covered, feel free to email me off list and I can add an example to
the page.
Cheers,
Josh
On Fri, Nov 25, 2011 at 2:09 PM, Ben quant <ccquant at gmail.com> wrote:
> Hello,
>
> Is there an example out there that shows how to treat each of the predictor
> variable types when doing logistic regression in R? Something like this:
>
> glm(y~x1+x2+x3+x4, data=mydata, family=binomial(link="logit"),
> na.action=na.pass)
>
> I'm drawing mostly from:
> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>
> ...but there are only two types of variable in the example given. I'm
> wondering if the answer is that easy or if I have to consider more with
> different types of variables. It seems like as.factor() is doing a lot of
> the organization for me.
>
> I will need to understand how to perform logistic regression in R on all
> data types all in the same model (potentially).
>
> As it stands, I think I can solve all of my data type issues with:
>
> as.factor(x,ordered=T) ...for all discrete ordinal variables
> as.factor(x, ordered=F) ...for all discrete nominal variables
> ...and do nothing for everything else.
>
> I'm pretty sure its not that simple because of some other posts I've seen,
> but I haven't seen a post that discusses ALL data types in logistic
> regression.
>
> Here is what I think will work at this point:
>
> glm(y ~ **all_other_vars + as.factor(disc_ord_var,ordered=T) +
> as.factor(disc_nom_var,ordered=F), data=mydata,
> family=binomial(link="logit"), na.action=na.pass)
>
> I'm also looking for any best practices help as well. I'm new'ish to
> R...and oddly enough I haven't had the pleasure of doing much regression R
> yet.
>
> Regards,
>
> Ben
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
More information about the R-help
mailing list