[R] Multinomial Analysis

soeren.vogel at eawag.ch soeren.vogel at eawag.ch
Wed Dec 15 18:44:00 CET 2010


I want to analyse data with an unordered, multi-level outcome variable, y. I am asking for the appropriate method (or R procedure) to use for this analysis.

> N <- 500
> set.seed(1234)
> data0 <- data.frame(y = as.factor(sample(LETTERS[1:3], N, repl = T, 
+     prob = c(10, 12, 14))), x1 = sample(1:7, N, repl = T, prob = c(8, 
+     8, 9, 15, 9, 9, 8)), x2 = sample(1:7, N, repl = T, prob = 7:1), 
+     x3 = round(rnorm(N, 0, 1), 0) + 4)

In [1], the use of "mlogit" is described (same package name). "mlogit" produces a nice output, but for some reason it does not accept the "." operator in my formula. In addition, it requires me to transform my data frame before running the analysis:

> library(mlogit)
> data1 <- mlogit.data(data0, varying = NULL, choice = "y", shape = "wide")
> summary(mod1 <- mlogit(y ~ 1 | ., data = data1, reflevel = "A"))

Error in terms.formula(object) : '.' in formula and no 'data' argument

> form2 <- paste("y", " ~ 1 | ", paste(names(data0)[-1], sep = "", 
+     collapse = " + "), sep = "", collapse = "")
> summary(mod2 <- mlogit(as.formula(form2), data = data1, reflevel = "A"))

Call:
mlogit(formula = as.formula(form2), data = data1, reflevel = "A", 
    method = "nr", print.level = 0)

Frequencies of alternatives:
    A     B     C 
0.282 0.338 0.380 

nr method
3 iterations, 0h:0m:0s 
g'(-H)^-1g = 8.8E-07 
gradient close to zero 

Coefficients :
         Estimate Std. Error t-value Pr(>|t|)
altB     0.269645   0.569943  0.4731   0.6361
altC     0.642654   0.551250  1.1658   0.2437
altB:x1  0.018484   0.061061  0.3027   0.7621
altC:x1  0.075489   0.059837  1.2616   0.2071
altB:x2 -0.102349   0.068194 -1.5008   0.1334
altC:x2 -0.031934   0.065491 -0.4876   0.6258
altB:x3  0.034394   0.115189  0.2986   0.7653
altC:x3 -0.137739   0.112124 -1.2284   0.2193

Log-Likelihood: -542.05
McFadden R^2:  0.0065888 
Likelihood ratio test : chisq = 7.1903 (p.value=0.30361)

This is no big problem if the data frame holds only few variables, however, it can easily become confusing if the number variables in the data frame increases. So, I can use the function "multinom" (package "nnet"), as described in [2] and [3]:

> library(nnet)
> summary(mod3 <- multinom(y ~ ., data = data0, na.action = na.omit, 
+     Hess = T, model = T))

# weights:  15 (8 variable)
initial  value 549.306144 
iter  10 value 542.075434
final  value 542.046320 
converged
Call:
multinom(formula = y ~ ., data = data0, na.action = na.omit, 
    Hess = T, model = T)

Coefficients:
  (Intercept)         x1          x2          x3
B   0.2696466 0.01848389 -0.10234898  0.03439345
C   0.6426552 0.07548893 -0.03193395 -0.13773888

Std. Errors:
  (Intercept)         x1         x2        x3
B   0.5699428 0.06106070 0.06819403 0.1151887
C   0.5512500 0.05983652 0.06549051 0.1121242

Residual Deviance: 1084.093 
AIC: 1100.093 

The estimates produced by this function are the same like in the analysis above ("mlogit"), and in principle, I should be able to calculate the z and fit statistics by hand. However, this seem more long-winded than using the "mlogit" function.

Alternatively, I may describe my results in the following way: Given that a person knows B and C (which is the case for all my cases), what drives some to use B, and what drives some to use C? Basically, I am wondering whether this analysis can be done using a set of two, parallel binary logistic regressions, one describing the step from A to B, and another one describing the step from A to C, while keeping predictors the same in both regressions.

More generally, I am not sure whether each multinomial logistic model could not just as well be analysed using consecutive binary logistic regressions. This has two advantages in my opinion, one, the model statistics and fit measures are well-described in binary models compared to multinomial ones (see [1] at the bottom), and two, they are much easier to communicate. But I am wondering, analogously to multiple mean comparisons, whether I run into any traps, and I have no idea what these traps look like.

I would kindly appreciate any advice.

Thanks, Sören

[1] http://www.ats.ucla.edu/stat/r/dae/mlogit.htm

[2] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York: Springer.

[3] http://www.stat.washington.edu/quinn/classes/536/S/multinomexample.html



More information about the R-help mailing list