--- title: "Incorporating Bayesian Statistics" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Incorporating Bayesian Statistics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette shows how to incorporate Bayesian modeling using the package [brms](https://paul-buerkner.github.io/brms/) in specr. When we fit a model with `brms`, the package actually calls Rstan in the background, which, in turn, is an R interface to the statistical programming language `Stan`. Stan is built in the programming language C++ and models have to be compiled using C++ to be run. This is all taken care of by brms, so you just need to run `brm(…)` to fit any model. `brms` uses a syntax for specifying model formulae that is based on the syntax of the commonly known lme4 package (for multilevel modeling). The comparatively easy syntax of `brms` is then converted into Stan code automatically. That said, since the models have to be compiled in C++, you need to set up your computer so that it can actually use C++. This has to be done only once, before installing brms. Unfortunately, as almost always, how you install C++ depends on your operating system. Check [this vignette](https://github.com/masurp/workshop_bayes/blob/main/exercises/Exercise_02.md) to if you have problems installing brms. We are also loading the library `furrr` as we want to parallelize the computations. ```r library(tidyverse) library(specr) library(brms) library(broom.mixed) library(ggridges) ``` ## Setting up custom functions Although you can simply pass the function `brm` to `setup` and run `specr()` with `workers = 1` without any specific custum function, it usually make sense to set up some custom model fitting and extraction functions to make sure `specr()` does exactly what we want. For example, I would suggest to create a custom model fitting function and specify relevant parameters for the Bayesian models. Here, I switch off parallel processing as I want to do this via `specr()` (i.e., setting `cores = 1` in `brm`). I also reduce the number of chains to 1 to speed up the fitting process (but we can bump this up if we want). This custom function also allows me to load `brms` itself and `broom.mixed`, which provides the tidy function to extract parameters from the `brm.fit` object. Furthermore, I also create a custom fit extract function. Most importantly, I add the full `brms.fit` object to the `glance` output. This way, we can later access the entire object in order to e.g., extract posterior distributions or other aspects of the Bayesian models. ```r # Customized function, not necessary if standard values are to be used brm_new <- function(formula, data) { brm(formula, data, silent = 2, # Don't print progress refresh = 0, # Remove message (not necessary, but nicer in the output) iter = 1000) # I just set this here to reduce computing time, can be higher, of course } # New fun2, fit extract function glance_brm <- function(x) { fit2 <- broom::glance(x) fit2$full_model <- list(x) # add full model return(fit2) } # Setting up specifications specs <- setup(data = example_data, x = c("x1", "x2", "x3"), y = c("y1", "y2"), model = "brm_new", controls = c("c1", "c2"), fun2 = glance_brm) # Check specs summary(specs) #> Setup for the Specification Curve Analysis #> ------------------------------------------- #> Class: specr.setup -- version: 1.0.0 #> Number of specifications: 24 #> #> Specifications: #> #> Independent variable: x1, x2, x3 #> Dependent variable: y1, y2 #> Models: brm_new #> Covariates: no covariates, c1, c2, c1 + c2 #> Subsets analyses: all #> #> Function used to extract parameters: #> #> function(x) broom::tidy(x, conf.int = TRUE) #> #> #> #> Head of specifications table (first 6 rows): #> # A tibble: 6 × 6 #> x y model controls subsets formula #> #> 1 x1 y1 brm_new no covariates all y1 ~ x1 + 1 #> 2 x1 y1 brm_new c1 all y1 ~ x1 + c1 #> 3 x1 y1 brm_new c2 all y1 ~ x1 + c2 #> 4 x1 y1 brm_new c1 + c2 all y1 ~ x1 + c1 + c2 #> 5 x1 y2 brm_new no covariates all y2 ~ x1 + 1 #> 6 x1 y2 brm_new c1 all y2 ~ x1 + c1 ``` ## Estimating the models Because we are estimating 24 Bayesian models, this may take a while, but brms automatically parallelizes the computations. ```r results <- specr(specs) ``` Now we can again summarize and plot our results. ```r summary(results) #> Results of the specification curve analysis #> ------------------- #> Technical details: #> #> Class: specr.object -- version: 1.0.0 #> Cores used: 1 #> Duration of fitting process: 9.25 sec elapsed #> Number of specifications: 24 #> #> Descriptive summary of the specification curve: #> #> median mad min max q25 q75 #> 0.16 0.25 -0.34 0.62 -0.07 0.25 #> #> Descriptive summary of sample sizes: #> #> median min max #> 1000 1000 1000 #> #> Head of the specification results (first 6 rows): #> #> # A tibble: 6 × 18 #> x y model controls subsets formula effect component group estimate std.error #> #> 1 x1 y1 brm_new no covariates all y1 ~ x1 + 1 fixed cond 0.62 0.04 #> 2 x1 y1 brm_new c1 all y1 ~ x1 + c1 fixed cond 0.6 0.04 #> 3 x1 y1 brm_new c2 all y1 ~ x1 + c2 fixed cond 0.61 0.04 #> 4 x1 y1 brm_new c1 + c2 all y1 ~ x1 + c… fixed cond 0.59 0.04 #> 5 x1 y2 brm_new no covariates all y2 ~ x1 + 1 fixed cond -0.33 0.04 #> 6 x1 y2 brm_new c1 all y2 ~ x1 + c1 fixed cond -0.34 0.04 #> # … with 7 more variable: conf.low , conf.high , fit_algorithm , fit_pss #> # fit_nobs , fit_sigma , fit_full_model plot(results, choices = c("x", "y", "controls")) ``` plot of chunk unnamed-chunk-4 Again, the coefficients are median estimates from the posterior and their respective 95% HDIs. ## Inspecting specific models Because we kept the entire brms models in our result data set, we can explore specific models using the `pull` function. ```r # Pull entire models from the listed vector models <- results %>% as_tibble %>% pull(fit_full_model) # Summarize e.g., the first model summary(models[[1]]) #> Family: gaussian #> Links: mu = identity; sigma = identity #> Formula: y1 ~ x1 + 1 #> Data: data (Number of observations: 1000) #> Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1; #> total post-warmup draws = 2000 #> #> Population-Level Effects: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> Intercept -1.10 0.04 -1.17 -1.02 1.00 1915 1356 #> x1 0.62 0.04 0.55 0.69 1.00 1900 1631 #> #> Family Specific Parameters: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> sigma 1.14 0.03 1.09 1.18 1.00 2903 1542 #> #> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS #> and Tail_ESS are effective sample size measures, and Rhat is the potential #> scale reduction factor on split chains (at convergence, Rhat = 1). ``` From these model fit objects, we can also extract posterior distributions for the parameters of interest and stort them in one data frame. ```r posteriors <- results %>% as_tibble %>% pull(fit_full_model) %>% map(function(x) as_tibble(as_draws_df(x))[, 2] %>% gather(key, value)) %>% bind_rows(., .id = "id") # First 10 draws from the first specification for predictor `x2` head(posteriors, n = 10) #> # A tibble: 10 × 3 #> id key value #> #> 1 1 b_x1 0.642 #> 2 1 b_x1 0.655 #> 3 1 b_x1 0.647 #> 4 1 b_x1 0.650 #> 5 1 b_x1 0.611 #> 6 1 b_x1 0.736 #> 7 1 b_x1 0.565 #> 8 1 b_x1 0.624 #> 9 1 b_x1 0.602 #> 10 1 b_x1 0.611 ``` ## Plotting posterior distributions for all specifications Using the `ggridges` package, we can plot the posterior distributions of the different specifications. ```r posteriors %>% mutate(id = factor(id, levels = 1:24)) %>% ggplot(aes(x = value, y = id, fill = key)) + geom_density_ridges() + scale_fill_brewer(palette = "Blues") + theme_minimal()+ labs(x = "Posterior", y = "Specifications", fill = "") ``` plot of chunk unnamed-chunk-7 Of course, the posterior distributions are not sorted in any way. With a bit of data wrangling, we can create a specification curve consisting of posterior distributions. ```r # First panel p1 <- results %>% as_tibble %>% mutate(id = as.character(1:nrow(.))) %>% arrange(estimate) %>% mutate(specifications = factor(as.character(1:nrow(.)), levels = c(1:24))) %>% left_join(posteriors) %>% ggplot(aes(x = value, y = specifications)) + stat_density_ridges(quantile_lines = TRUE, quantiles = c(0.025, 0.975), alpha = .5, fill = "lightblue", color = "black") + coord_flip() + theme_minimal() + labs(x = "Posterior", y = "") + theme(strip.text = element_blank(), axis.line = element_line("black", size = .5), legend.position = "none", panel.spacing = unit(.75, "lines"), axis.text = element_text(colour = "black")) # Second panel p2 <- results %>% as_tibble %>% arrange(estimate) %>% mutate(specifications = factor(as.character(1:nrow(.)), levels = c(1:24))) %>% gather(key, value, c("x", "y", "controls")) %>% mutate(key = factor(key, levels = c("x", "y", "controls"))) %>% ggplot() + geom_point(aes(x = specifications, y = value), shape = 124, size = 3.35) + theme_minimal() + facet_grid(.data$key~1, scales = "free_y", space = "free_y") + theme( axis.line = element_line("black", size = .5), legend.position = "none", panel.spacing = unit(.75, "lines"), axis.text = element_text(colour = "black"), strip.text.x = element_blank()) + labs(x = "", y = "") # Combine plot_grid(p1, p2, ncol = 1, align = "hv", axis = "tblr", rel_heights = c(3, 2)) ``` plot of chunk unnamed-chunk-8