## tl;dr

When your MCMC chains look a mess, you might have to manually set your initial values. If you’re a fancy pants, you can use a custom function.

## Context

A collaborator asked me to help model some reaction-time data. One of the first steps was to decide on a reasonable likelihood function. You can see a productive Twitter thread on that process here. Although I’ve settled on the shifted-lognormal function, I also considered the exponentially modified Gaussian function (a.k.a. exGaussian). As it turns out, the exGaussian can be fussy to work with! After several frustrating attempts, I solved the problem by fiddling with my initial values. The purpose of this post is to highlight the issue and give you some options.

### I make assumptions.

• This post is for Bayesians. For thorough introductions to contemporary Bayesian regression, I recommend either edition of McElreath’s text (2020, 2015); Kruschke’s (2015) text; or Gelman, Hill, and Vehtari’s (2020) text.
• Though not necessary, it will help if you’re familiar with multilevel regression. The texts by McElreath and Kruschke, from above, can both help with that.
• All code is in R , with an emphasis on the brms package . We will also make good use of the tidyverse , the patchwork package , and ggmcmc . We will also use the lisa package to select the color palette for our figures.

Load the primary R packages and adjust the global plotting theme defaults.

# load
library(tidyverse)
library(brms)
library(patchwork)
library(ggmcmc)
library(lisa)

# define the color palette
fk <- lisa_palette("FridaKahlo", n = 31, type = "continuous")

# adjust the global plotting theme
theme_set(
theme_gray(base_size = 13) +
theme(
text = element_text(family = "Times", color = fk),
axis.text = element_text(family = "Times", color = fk),
axis.ticks = element_line(color = fk),
legend.background = element_blank(),
legend.box.background = element_blank(),
legend.key = element_blank(),
panel.background = element_rect(fill = alpha(fk, 1/4), color = "transparent"),
panel.grid = element_blank(),
plot.background = element_rect(fill = alpha(fk, 1/4), color = "transparent")
)
)

The color palette in this post is inspired by Frida Kahlo’s Self-Portrait with Thorn Necklace and Hummingbird.

## We need data

I’m not at liberty to share the original data. However, I have simulated a new data set that has the essential features of the original and I have saved the file on GitHub. You can load it like this.

load(url("https://github.com/ASKurz/blogdown/raw/main/content/post/2021-06-05-don-t-forget-your-inits/data/dat.rda?raw=true"))

# what is this?
glimpse(dat)
## Rows: 29,281
## Columns: 2
## $id <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a… ##$ rt <dbl> 689.0489, 552.8998, 901.0891, 992.2104, 1218.2256, 1356.5888, 679.0385, 663.7340, 771.3938, 996.2…

Our primary variable of interest is rt, which is simulated reaction times in milliseconds. The reaction times are nested within 26 participants, who are denoted by the id column. The data are not balanced.

dat %>%
count(id, name = "trials") %>%
count(trials)
## # A tibble: 7 x 2
##   trials     n
##    <int> <int>
## 1    320     2
## 2    640     4
## 3    960     3
## 4   1121     1
## 5   1280    14
## 6   1600     1
## 7   2560     1

Whereas most participants have 1,280 trials, their numbers range from 320 to 2,560, which means we’ll want a multilevel model.

## We can describe the data with the exGaussian function

To start getting a sense of the rt data, we’ll make a density plot of the overall distribution.

dat %>%
ggplot(aes(x = rt)) +
geom_density(fill = fk, color = fk) As is typical of reaction times, the data are continuous, non-negative, and strongly skewed to the right. There are any number of likelihood functions one can use to model data of this kind. One popular choice is the exGaussian. The exGaussian distribution has three parameters: $$\mu$$, $$\sigma$$, and $$\beta$$1. The $$\mu$$ and $$\sigma$$ parameters govern the mean and standard deviation for the central Gaussian portion of the distribution. The $$\beta$$ parameter governs the rate of the exponential distribution, which is tacked on to the right-hand side of the distribution. Within R, you can compute the density of various exGaussian distributions using the brms::dexgaussian() function. If you fool around with the parameter settings, a bit, you can make an exGaussian curve that fits pretty closely to the shape of our rt data. For example, here’s what it looks like when we set mu = 1300, sigma = 150, and beta = 520.

tibble(rt = seq(from = 0, to = 5500, length.out = 300),
d = dexgaussian(rt, mu = 1300, sigma = 150, beta = 520)) %>%

ggplot(aes(x = rt)) +
geom_density(data = dat,
fill = fk, color = fk) +
geom_line(aes(y = d),
color = fk, size = 5/4) +
# zoom in on the bulk of the values
coord_cartesian(xlim = c(0, 5000)) The fit isn’t perfect, but it gives a sense of where things are headed. It’s time to talk about modeling.

## Models

In this post, we will explore three options for modeling the reaction-time data. The first will use default options. The second option will employ manually-set starting points. For the third option, we will use pseudorandom number generators to define the starting points, all within a custom function.

### Model 1: Use the exGaussian with default settings.

When using brms, you can fit an exGaussian model by setting family = exgaussian(). Here we’ll allow the $$\mu$$ parameters to vary by participant, but keep the $$\sigma$$ and $$\beta$$ parameters fixed.

fit1 <- brm(
data = dat,
family = exgaussian(),
formula = rt ~ 1 + (1 | id),
cores = 4, seed = 1
)

I’m not going to show them all, here, for the sake of space, but this model returned warnings about 604 transitions, 1 chain for which the estimated Bayesian Fraction of Missing Information was low, a large R-hat value of 2.85, and low bulk and tail effective sample sizes. In other words, this was a disaster. To help bring these all into focus, we’ll want to take a look at the chains in a trace plot. Since we’ll be doing this a few times, let’s go ahead and make a custom trace plot geom to suit our purposes. We’ll call it geom_trace().

geom_trace <- function(subtitle = NULL,
xlab = "iteration",
xbreaks = 0:4 * 500) {

list(
annotate(geom = "rect",
xmin = 0, xmax = 1000, ymin = -Inf, ymax = Inf,
fill = fk, alpha = 1/2, size = 0),
geom_line(size = 1/3),
scale_color_manual(values = fk[c(3, 8, 27, 31)]),
scale_x_continuous(xlab, breaks = xbreaks, expand = c(0, 0)),
labs(subtitle = subtitle),
theme(panel.grid = element_blank())
)

}

For your real-world models, it’s good to look at the tract plots for all major model parameters. Here we’ll just focus on the $$\mu$$ intercept.

p1 <- ggs(fit1, burnin = TRUE) %>%
filter(Parameter == "b_Intercept") %>%
mutate(chain = factor(Chain),
intercept = value) %>%

ggplot(aes(x = Iteration, y = intercept, color = chain)) +
geom_trace(subtitle = "fit1 (default settings)") +
scale_y_continuous(breaks = c(0, 650, 1300), limits = c(NA, 1430))

p1 Since we pulled the chains using the ggmcmc::ggs() function, we were able to plot the warmup iterations (darker beige background on the left) along with the post-warmup iterations (lighter beige background on the right). Although one of our chains eventually made its way to the posterior, three out of the four stagnated near their starting values. This brings us to a major point in this post: Starting points can be a big deal.

### Starting points can be a big deal.

I’m not going to go into the theory underlying Markov chain Monte Carlo (MCMC) methods in any detail. For that, check out some of the latter chapters in Gill (2015) or . In brief, if you run a Markov chain for an infinite number of iterations, it will converge on the correct posterior distribution. The problem is we can’t run our chains for that long, which means we have to be careful about whether our finite-length chains have converged properly. Starting points are one of the factors that can influence this process.

One of the ways to help make sure your MCMC chains are sampling well is to run multiple chains for a while and check to see whether they have all converged around the same parameter space. Ideally, each chain will start from a different initial value. In practice, the first several iterations following the starting values are typically discarded. With older methods, like the Gibbs sampler, this was called the “burn-in” period. With Hamiltonian Monte Carlo (HMC), which is what brms uses, we have a similar period called “warmup.” When everything goes well, the MCMC chains will all have traversed from their starting values to sampling probabilistically from the posterior distribution once they have emerged from the warmup phase. However, this isn’t always the case. Sometimes the chains get stuck around their stating values and continue to linger there, even after you have terminated the warmup period. When this happens, you’ll end up with samples that are still tainted by their starting values and are not yet representative of the posterior distribution.

In our example, above, we used the brms default settings of four chains, each of which ran for 1,000 warmup iterations and then 1,000 post-warmup iterations. We also used the brms default for the starting values. These defaults are based on the Stan defaults, which is to randomly select the starting points from a uniform distribution ranging from -2 to 2. For details, see the Random initial values section of the Stan Reference Manual .

In my experience, the brms defaults are usually pretty good. My models often quickly traverse from their starting values to concentrate in the posterior, just like our second chain did, above. When things go wrong, sometimes adding stronger priors can work. Other times it makes sense to rescale or reparameterize the model, somehow. In this case, I have reasons to want to (a) use default priors and to (b) stick to the default parameterization applied to the transformed data. Happily, we have another trick at out disposal: We can adjust the starting points.

Within brms::brm(), we can control the starting values with the inits argument. The default is inits = "random", which follows the Stan convention of sampling from $$(-2, 2)$$, as discussed above. Another option is to fix all starting values to zero by setting inits = "0". This often works surprisingly well, but it wasn’t the solution in this case. If you look at the trace plot, above, you’ll see that all the starting values are a long ways from the target range, which is somewhere around 1,300. So why not just put the starting values near there?

### Model 2: Fit the model with initial values set by hand.

When you specify start values for the parameters in your Stan models, you need to do so with a list of lists. Each MCMC chain will need its own list. In our case, that means we’ll need four separate lists, each of which will be nested within a single higher-order list. For example, here we’ll define a single list called inits, which will have starting values defined for our primary three population-level parameters.

inits <- list(
Intercept = 1300,
sigma     = 150,
beta      = 520
)

# what is this?
inits
## $Intercept ##  1300 ## ##$sigma
##  150
##
## $beta ##  520 Notice that we didn’t bother setting a starting value for the standard-deviation parameter for the random intercepts. That parameter, then, will just get the brms default. The others will the the start values, as assigned. Now, since we have four chains to assign start values to, a quick and dirty method is to just use the same ones for all four chains. Here’s how to do that. list_of_inits <- list(inits, inits, inits, inits) Our list_of_inits object is a list into which we have saved four copies of our inits list. Here’s how to use those values within brms::brm(). Just plug them into the inits argument. fit2 <- brm( data = dat, family = exgaussian(), formula = rt ~ 1 + (1 | id), cores = 4, seed = 1, inits = list_of_inits ) The effective sample sizes are still a little low, but the major pathologies are now gone. Compare the updated traceplot for the intercept to the first one. # adjust fit1 p1 <- p1 + geom_trace(subtitle = "fit1 (default settings)", xlab = NULL, xbreaks = NULL) # fit2 p2 <- ggs(fit2) %>% filter(Parameter == "b_Intercept") %>% mutate(chain = factor(Chain), intercept = value) %>% ggplot(aes(x = Iteration, y = intercept, color = chain)) + geom_trace(subtitle = "fit2 (manual copy/paste inits settings)") + coord_cartesian(ylim = c(1200, 1400)) # combine p1 / p2 Man that looks better! See how all four of our chains started out at 1,300? That’s because of how we copy/pasted inits four times within our list_of_inits object. This is kinda okay, but we can do better. ### Model 3: Set the initial values with a custom function. Returning back to MCMC theory, a bit, it’s generally a better idea to assign each chain its own starting value. Then, if all chains converge into the same part in the parameter space, that provides more convincing evidence they’re all properly exploring the posterior. To be clear, this isn’t rigorous evidence. It’s just better evidence than if we started them all in the same spot. One way to give each chain its own starting value would be to manually set them. Here’s what that would look like if we were only working with two chains. # set the values for the first chain inits1 <- list( Intercept = 1250, sigma = 140, beta = 500 ) # set new values for the second chain inits2 <- list( Intercept = 1350, sigma = 160, beta = 540 ) # combine the two lists into a single list list_of_inits <- list(inits1, inits2) This approach will work fine, but it’s tedious, especially if you’d like to apply it to a large number of parameters. A more programmatic approach would be to use a pseudorandom number-generating function to randomly set the starting values. Since the intercept is an unbounded parameter, the posterior for which will often look Gaussian, the rnorm() function can be a great choice for selecting its starting values. Since both $$\sigma$$ and $$\beta$$ parameters need to be non-negative, a better choice might be the runif() or rgamma() functions. Here we’ll just use runif() for each. Since we’re talking about using the pseudorandom number generators to pick our values, it would be nice if the results were reproducible. We can do that by working in the set.seed() function. Finally, it would be really sweet if we had a way to wrap set.seed() and the various number-generating functions into a single higher-order function. Here’s one way to make such a function, which I’m calling set_inits(). set_inits <- function(seed = 1) { set.seed(seed) list( Intercept = rnorm(n = 1, mean = 1300, sd = 100), sigma = runif(n = 1, min = 100, max = 200), beta = runif(n = 1, min = 450, max = 550) ) } # try it out set_inits(seed = 0) ##$Intercept
##  1426.295
##
## $sigma ##  137.2124 ## ##$beta
##  507.2853

Notice how we set the parameters within the rnorm() and runif() functions to values that seemed reasonable given our model. These values aren’t magic and you could adjust them to your own needs. Now, here’s how to use our handy set_inits() function to choose similar, but distinct, starting values for each of our four chains. We save the results in a higher-order list called my_second_list_of_inits.

my_second_list_of_inits <- list(
# different seed values will return different results
set_inits(seed = 1),
set_inits(seed = 2),
set_inits(seed = 3),
set_inits(seed = 4)
)

# what have we done?
str(my_second_list_of_inits)
## List of 4
##  $:List of 3 ## ..$ Intercept: num 1237
##   ..$sigma : num 157 ## ..$ beta     : num 541
##  $:List of 3 ## ..$ Intercept: num 1210
##   ..$sigma : num 157 ## ..$ beta     : num 467
##  $:List of 3 ## ..$ Intercept: num 1204
##   ..$sigma : num 138 ## ..$ beta     : num 483
##  $:List of 3 ## ..$ Intercept: num 1322
##   ..$sigma : num 129 ## ..$ beta     : num 478

Now just plug my_second_list_of_inits into the inits argument and fit the model.

fit3 <- brm(
data = dat,
family = exgaussian(),
formula = rt ~ 1 + (1 | id),
cores = 4, seed = 1,
inits = my_second_list_of_inits
)

As with fit2, our fit3 came out okay. Let’s inspect the intercept parameter with a final trace plot.

# adjust fit2
p2 <- p2 +
geom_trace(subtitle = "fit2 (manual copy/paste inits settings)",
xlab = NULL, xbreaks = NULL)

# fit3
p3 <- ggs(fit3) %>%
filter(Parameter == "b_Intercept") %>%
mutate(chain = factor(Chain),
intercept = value) %>%

ggplot(aes(x = Iteration, y = intercept, color = chain)) +
geom_trace(subtitle = "fit3 (inits by a custom function)") +
coord_cartesian(ylim = c(1200, 1400))

# combine
p1 / p2 / p3 Now we have visual evidence that even though all four chains started at different places in the parameter space, they all converged into the same area. This still isn’t fully rigorous evidence our chains are performing properly, but it’s a major improvement from fit1 and a minor improvement from fit2. They aren’t shown here, but the same point holds for the $$\sigma$$ and $$\beta$$ parameters.

Okay, just for kicks and giggles, let’s see how well our last model did by way of a posterior predictive check.

bayesplot::color_scheme_set(fk[c(31, 31, 31, 3, 3, 3)])

pp_check(fit3, nsamples = 100) +
# we don't need to see the whole right tail
coord_cartesian(xlim = c(0, 5000)) The model could be better, but it’s moving in the right direction and there don’t appear to be any major pathologies, like what we saw with fit1.

## Recap

• If your try to fit a model with MCMC, you may sometimes end up with pathologies, such as divergent transitions, large numbers of transitions, high R-hat values, and/or very low effective sample size estimates.
• Sometimes these pathologies arise when the starting values for your chains are far away from the centers of their posterior densities.
• When using brms, you can solve this problem by setting the starting values with the inits argument.
• One approach is to manually set the starting values, saving them in a list of lists.
• Another approach is to use the pseudorandom number generators, such as rnorm() and runif(), to assign starting values within user-defined ranges.

## Session info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
##  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
##  stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##   lisa_0.1.2      ggmcmc_1.5.1.1  patchwork_1.1.1 brms_2.15.0     Rcpp_1.0.6      forcats_0.5.1
##   stringr_1.4.0   dplyr_1.0.6     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.2
##  ggplot2_3.3.3   tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
##    readxl_1.3.1         backports_1.2.1      plyr_1.8.6           igraph_1.2.6         splines_4.0.4
##    crosstalk_1.1.0.1    TH.data_1.0-10       rstantools_2.1.1     inline_0.3.17        digest_0.6.27
##   htmltools_0.5.1.1    rsconnect_0.8.16     fansi_0.4.2          magrittr_2.0.1       modelr_0.1.8
##   RcppParallel_5.0.2   matrixStats_0.57.0   xts_0.12.1           sandwich_3.0-0       prettyunits_1.1.1
##   colorspace_2.0-0     rvest_0.3.6          haven_2.3.1          xfun_0.23            callr_3.7.0
##   crayon_1.4.1         jsonlite_1.7.2       lme4_1.1-25          survival_3.2-10      zoo_1.8-8
##   glue_1.4.2           gtable_0.3.0         emmeans_1.5.2-1      V8_3.4.0             pkgbuild_1.2.0
##   rstan_2.21.2         abind_1.4-5          scales_1.1.1         mvtnorm_1.1-1        GGally_2.1.1
##   DBI_1.1.0            miniUI_0.1.1.1       xtable_1.8-4         stats4_4.0.4         StanHeaders_2.21.0-7
##   DT_0.16              htmlwidgets_1.5.2    httr_1.4.2           threejs_0.3.3        RColorBrewer_1.1-2
##   ellipsis_0.3.2       farver_2.1.0         reshape_0.8.8        pkgconfig_2.0.3      loo_2.4.1
##   sass_0.3.1           dbplyr_2.0.0         utf8_1.2.1           labeling_0.4.2       tidyselect_1.1.1
##   rlang_0.4.11         reshape2_1.4.4       later_1.2.0          munsell_0.5.0        cellranger_1.1.0
##   tools_4.0.4          cli_2.5.0            generics_0.1.0       broom_0.7.6          ggridges_0.5.3
##   evaluate_0.14        fastmap_1.1.0        yaml_2.2.1           processx_3.5.2       knitr_1.33
##   fs_1.5.0             nlme_3.1-152         mime_0.10            projpred_2.0.2       xml2_1.3.2
##   rstudioapi_0.13      compiler_4.0.4       bayesplot_1.8.0      shinythemes_1.1.2    curl_4.3
##   gamm4_0.2-6          reprex_0.3.0         statmod_1.4.35       bslib_0.2.4          stringi_1.6.2
##   highr_0.9            ps_1.6.0             blogdown_1.3         Brobdingnag_1.2-6    lattice_0.20-41
##   Matrix_1.3-2         nloptr_1.2.2.2       markdown_1.1         shinyjs_2.0.0        vctrs_0.3.8
##  pillar_1.6.1         lifecycle_1.0.0      jquerylib_0.1.4      bridgesampling_1.0-0 estimability_1.3
##  httpuv_1.6.0         R6_2.5.0             bookdown_0.22        promises_1.2.0.1     gridExtra_2.3
##  codetools_0.2-18     boot_1.3-26          colourpicker_1.1.0   MASS_7.3-53          gtools_3.8.2
##  assertthat_0.2.1     withr_2.4.2          shinystan_2.5.0      multcomp_1.4-16      mgcv_1.8-33
##  parallel_4.0.4       hms_0.5.3            grid_4.0.4           coda_0.19-4          minqa_1.2.4
##  rmarkdown_2.8        shiny_1.6.0          lubridate_1.7.9.2    base64enc_0.1-3      dygraphs_1.1.1.6

1. There are different ways to parameterize the exGaussian distribution and these differences may involve different ways to express what we’re calling $$\beta$$. Since our parameterization is based on Paul Bürkner’s work, you might check out the Response time models section in his (2021) document, Parameterization of response distributions in brms.↩︎