[R] reference category for factor in regression

Marc Schwartz marc_schwartz at comcast.net
Mon Jan 19 17:51:14 CET 2009


Jos,

See ?relevel for information on how to reorder the levels of a factor,
while being able to specify the reference level.

Basically, the first level of the factor is taken as the reference. If
you want to utilize a different ordering, as an alternative to the
above, simply use:

  AGE <- factor(AGE, levels = c(FirstLevel, SecondLevel, ...)

BTW, you might want to review Frank Harrell's page on why categorizing a
continuous variable is not a good idea:

  http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous

HTH,

Marc Schwartz


on 01/19/2009 09:52 AM Jos Elkink wrote:
> Hi Thierry,
> 
> Thanks for your quick answer. The problem is not so much the LABOUR
> variable, however, but the AGE variable, which consists of about 5
> categories for which I do indeed not create separate dummy variables.
> But R does not behave as expected when deciding on which dummy to use
> as reference category ...
> 
> Jos
> 
> On Mon, Jan 19, 2009 at 2:37 PM, ONKELINX, Thierry
> <Thierry.ONKELINX at inbo.be> wrote:
>> Dear Jos,
>>
>> In R you don't need to create you own dummy variables. Just create a
>> factor variable LABOUR (with two levels) and rerun your model. Then you
>> should be able to calculate all coefficients.
>>
>> HTH,
>>
>> Thierry
>>
>> ------------------------------------------------------------------------
>> ----
>> ir. Thierry Onkelinx
>> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>> and Forest
>> Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
>> methodology and quality assurance
>> Gaverstraat 4
>> 9500 Geraardsbergen
>> Belgium
>> tel. + 32 54/436 185
>> Thierry.Onkelinx at inbo.be
>> www.inbo.be
>>
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to
>> say what the experiment died of.
>> ~ Sir Ronald Aylmer Fisher
>>
>> The plural of anecdote is not data.
>> ~ Roger Brinner
>>
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of
>> data.
>> ~ John Tukey
>>
>> -----Oorspronkelijk bericht-----
>> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> Namens Jos Elkink
>> Verzonden: maandag 19 januari 2009 15:16
>> Aan: r-help at r-project.org
>> Onderwerp: [R] reference category for factor in regression
>>
>> Hi all,
>>
>> I am struggling with a strange issue in R that I have not encountered
>> before and I am not sure how to resolve this.
>>
>> The model looks like this, with all irrelevant variables left out:
>>
>> LABOUR - a dummy variable
>> NONLABOUR = 1 - LABOUR
>> AGE - a categorical variable / factor
>> VOTE - a dummy variable
>>
>> glm(VOTE ~ 0 + LABOUR + NONLABOUR + LABOUR : AGE + NONLABOUR : AGE,
>> family=binomial(link="logit"))
>>
>> In other words, a standard interaction model, but I want to know the
>> intercepts and coefficients for each of the two cases (LABOUR and
>> NONLABOUR), instead of getting coefficients for the differences as in
>> a normal interaction model.
>>
>> But the strange thing is, for the two occurances of the AGE variable,
>> it makes a different choice as to which AGE category to leave out of
>> the regression. The cross-table of AGE with LABOUR does not have empty
>> cells.
>>
>> Anyone any idea what might be going wrong? Or what I could do about
>> this?
>>
>> Thanks in advance for any help!
>>
>> Regards,
>>
>> Jos
>>




More information about the R-help mailing list