[R] Dummy variables or factors?

Wed Oct 21 05:51:27 CEST 2009

Oh dear, that doesn't look right at all.  I shall have a think about
what I did wrong and maybe follow my own advice and consult the doco
myself!

On Oct 21, 2:45 pm, andrew <andrewjohnro... at gmail.com> wrote:
> The following is *significantly* easier to do than try and add in
> dummy variables, although the dummy variable approach is going to give
> you exactly the same answer as the factor method, but possibly with a
> different baseline.
>
> Basically, you might want to search the lm help and possibly consult a
> stats book on information about how the design matrix is constructed
> in both cases.
>
> > xF <- factor(1:10)
> > N <- 1000
> > xFs <- sample(x=xF,N,replace = T)
> > yFs <- rnorm(N, mean = as.numeric(xFs))
> > lm(yFs ~ xFs)
>
> Call:
> lm(formula = yFs ~ xFs)
>
> Coefficients:
> (Intercept)         xFs2         xFs3         xFs4
> xFs5         xFs6         xFs7         xFs8
>      0.7845       1.1620       2.1474       3.1391       4.2183
> 5.2621       6.0814       7.4170
>        xFs9        xFs10
>      8.2193       9.2987
>
> > lm(yFs ~ diag(10)[,1:9][xFs,])
>
> Call:
> lm(formula = yFs ~ diag(10)[, 1:9][xFs, ])
>
> Coefficients:
>             (Intercept)  diag(10)[, 1:9][xFs, ]1  diag(10)[, 1:9]
> [xFs, ]2  diag(10)[, 1:9][xFs, ]3
>                  10.083                   -9.299
> -8.137                   -7.151
> diag(10)[, 1:9][xFs, ]4  diag(10)[, 1:9][xFs, ]5  diag(10)[, 1:9]
> [xFs, ]6  diag(10)[, 1:9][xFs, ]7
>                  -6.160                   -5.080
> -4.037                   -3.217
> diag(10)[, 1:9][xFs, ]8  diag(10)[, 1:9][xFs, ]9
>                  -1.882                   -1.079
>
> On Oct 21, 9:44 am, David Winsemius <dwinsem... at comcast.net> wrote:
>
>
>
> > On Oct 20, 2009, at 4:00 PM, Luciano La Sala wrote:
>
> > > Dear R-people,
>
> > > I am analyzing epidemiological data using GLMM using the lmer  
> > > package. I usually explore the assumption of linearity of continuous  
> > > variables in the logit of the outcome by creating 4 categories of  
> > > the variable, performing a bivariate logistic regression, and then  
> > > plotting the coefficients of each category against their mid points.  
> > > That gives me a pretty good idea about the linearity assumption and  
> > > possible departures from it.
>
> > > I know of people who create 0,1 dummy variables in order to relax  
> > > the linearity assumption. However, I've read that dummy variables  
> > > are never needed (nor are desireble) in R! Instead, one should make  
> > > use of factors variable. That is much easier to work with than dummy  
> > > variables and the model itself will create the necessary dummy  
> > > variables.
>
> > > Having said that, if my data violates the linearity assumption, does  
> > > the use of a factors for the variable in question helps overcome the  
> > > lack of linearity?
>
> > No. If done by dividing into samall numbers of categories after  
> > looking at the data, it merely creates other (and probably more  
> > severe) problems. If you are in the unusal (although desirable)  
> > position of having a large number of events across the range of the  
> > covariates in your data, you may be able to cut your variable into  
> > quintiles or deciles and analyze the resulting factor, but the  
> > preferred approach would be to fit a regression spline of sufficient  
> > complexity.
>
> > > Thanks in advance.
>
> > --
>
> > David Winsemius, MD
> > Heritage Laboratories
> > West Hartford, CT
>
> > ______________________________________________
> > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.