Packages that we will use for data prep, IRT-M estimation, and data visualization:
## Data prep:
library(tidyverse) # version: tidyverse_2.0.0
library(dplyr) #version: dplyr_1.1.4
library(stats) # version: stats4
library(fastDummies) # version: fastDummies_1.7.3
library(reshape2) #version: reshape2_1.4.4
## IRT-M estimation:
#devtools::install_github("dasiegel/IRT-M")
library(IRTM) #version 1.00
## Results visualization:
library(ggplot2) # version: ggplot2_3.4.4
library(ggridges) #version: ggridges_0.5.6
library(RColorBrewer) #version: RColorBrewer_1.1-3
library(ggrepel) # version: ggrepel_0.9.5
In this document we narratively walk through use of the IRT-M package.
While the IRT-M framework is case agnostic, this vignette focuses on a single use case in which a hypothetical research team is interested in seeking empirical support for the hypothesis that anti-immigration attitudes in Europe are associated with perceptions of cultural, economic, and security threats. We illustrate data preparation, IRT-M estimation, visualization, and analysis with a small synthetic dataset (N=3000) based on the Eurobarometer 94.3. The real data can be accessed via: .
This example illustrates one of the strengths of the IRT-M model; namely the (very common) situation in which researchers have a theoretical question and related data that does not directly address the substantive question of interest. In this case, we work through a research question derived from the literature on European attitudes towards immigration. The threat response hypothesis is supported by the literature~ but was not a specific focus of the 2020-2021 Eurobarometer wave. Consequentially, the survey did not directly ask questions about the three threat dimensions of interest. However, the survey contained several questions adjacent to threat perception. We can use these questions to build an M-Matrix and estimate latent threat dimensions.
Regardless of the use case, there are three steps to preparing data for the IRT-M Package:
Gfortran is a necessary to load the package. Gfortran can be readily downloaded and the version installed can be checked: On Windows, enter into the command line “$ gfortran –version GNU Fortran.” On Mac, enter into the terminal “which gfortran.” After gfortran has been successfully installed, it is important to also have GCC (GNU Compiler Collection) installed.
Formatting the Data
IRT-M requires that users convert categorical variables into a
numeric format with one response loaded into each input object (aka:
question). The most straightforward way to do this is to use a library,
such as fastDummies
to expand the entire instrument into
one-hot or binary encoding. This is also a good opportunity to export an
empty data-frame with the list of question codes, to code the M-Code
object. Doing so reduces the likelihood of formatting slippages that are
tedious to fix. We have also found it to be worthwhile to insert column
with human-readable notes next to the question codes. This step adds
some overhead, but— much like commenting code in general— is invaluable
for debugging and analysis.
The core step in using the IRT-M package is to map the underlying theoretical constructions of interest into an ‘M-Matrix’ of dimension loadings. In order to develop the loadings, we go through every question on the “test” object (here: every question in the survey) and decide whether the question relates to one of our hypothesized theoretical dimensions. For those questions that we believe load on the theoretical dimension of interest, we can code whether we expect it to positively load—meaning that an affirmative response implies more of the dimension— or negatively load. We code positive “loading” as \(1\) in the M-Matrix, negative “loading” as \(-1\). We can also denote that the question has no relationship with the theoretical dimension of interest, in which case we code 0 into the dimension. If we are unsure, not coding the loading inserts \(NA\) value, and the model will learn any loading during the estimation step.
To begin, we reformat data so that each possible answer becomes a
separate binary question (One Hot encoding). In preparing the data, we
used the dummy_cols()
utility from the
fastdummies
package. Finally, we rename the new binary
dataframe as Y
to underscore that this is the observed data
that we will be modeling. Please ensure that the dataPath
variable is adjusted for your local file structure.