# Robust Linear Regression with Student’s $t$-Distribution

[edited Nov 30, 2020]

The purpose of this post is to demonstrate the advantages of the Student’s \(t\)-distribution for regression with outliers, particularly within a Bayesian framework.

## I make assumptions

I’m presuming you are familiar with linear regression, familiar with the basic differences between frequentist and Bayesian approaches to fitting regression models, and have a sense that the issue of outlier values is a pickle worth contending with. All code in is **R**, with a heavy use of the **tidyverse** (Wickham, 2019; Wickham et al., 2019), about which you might learn a lot more from Grolemund & Wickham (2017), especially chapter 5. The Bayesian models are fit with Paul Bürkner’s (2017, 2018, 2020) **brms** package.

## The problem

Simple regression models typically use the Gaussian likelihood. Say you have some criterion variable \(y\), which you can reasonably describe with a mean \(\mu\) and standard deviation \(\sigma\). Further, you’d like to describe \(y\) with a predictor \(x\). Using the Gaussian likelihood, we can describe the model as

\[ \begin{align*} y_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \beta_0 + \beta_1 x_i. \end{align*} \]

With this formulation, we use \(x\) to model the mean of \(y\). The \(\beta_0\) parameter is the intercept of the regression model and \(\beta_1\) is its slope with respect to \(x\). After accounting for \(y\)’s relation with \(x\), the leftover variability in \(y\) is described by \(\sigma\), often called error or residual variance. The reason we describe the model in terms of \(\mu\) and \(\sigma\) is because those are the two parameters by which we define the Normal distribution, the Gaussian likelihood.

The Gaussian is a sensible default choice for many data types. You might say it works unreasonably well. Unfortunately, the normal (i.e., Gaussian) distribution is sensitive to outliers.

The normal distribution is a special case of Student’s \(t\)-distribution with the \(\nu\) parameter (i.e., the degree of freedom) set to infinity. However, when \(\nu\) is small, Student’s \(t\)-distribution is more robust to multivariate outliers. See Gelman & Hill (2006, Chapter 6), Kruschke (2015, Chapter 16), or McElreath (2020, Chapter 7) for textbook treatments on the topic.

In this post, we demonstrate how vulnerable the Gaussian likelihood is to outliers and then compare it to different ways of using Student’s \(t\)-likelihood for the same data.

First, we’ll get a sense of the distributions with a plot.

```
library(tidyverse)
tibble(x = seq(from = -6, to = 6, by = .01)) %>%
expand(x, nu = c(1, 2.5, 5, 10, Inf)) %>%
mutate(density = dt(x = x, df = nu),
nu = factor(nu, levels = c("Inf", "10", "5", "2.5", "1"))) %>%
ggplot(aes(x = x, y = density, group = nu, color = nu)) +
geom_line() +
scale_color_viridis_d(expression(nu),
direction = 1, option = "C", end = .85) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(-5, 5)) +
xlab(NULL) +
theme(panel.grid = element_blank())
```

So the difference is that a Student’s \(t\)-distribution with a low \(\nu\) will have notably heavier tails than the conventional Gaussian distribution. It’s easiest to see the difference when \(\nu\) approaches 1. Even then, the difference can be subtle when looking at a plot. Another way is to compare how probable relatively extreme values are in a Student’s \(t\)-distribution relative to the Gaussian. For the sake of demonstration, here we’ll compare Gauss with Student’s \(t\) with a \(\nu\) of 5. In the plot above, they are clearly different, but not shockingly so. However, that difference is very notable in the tails.

Let’s look more closely with a table. Below, we compare the probability of a given \(z\)-score or lower within the Gaussian and a \(\nu = 5\) Student’s \(t\). In the rightmost column, we compare the probabilities in a ratio.

```
# here we pic our nu
nu <- 5
tibble(z_score = 0:-5,
p_Gauss = pnorm(z_score, mean = 0, sd = 1),
p_Student_t = pt(z_score, df = nu),
`Student/Gauss ratio` = p_Student_t/p_Gauss) %>%
mutate_if(is.double, round, digits = 5) %>%
knitr::kable()
```

z_score | p_Gauss | p_Student_t | Student/Gauss ratio |
---|---|---|---|

0 | 0.50000 | 0.50000 | 1.00000 |

-1 | 0.15866 | 0.18161 | 1.14468 |

-2 | 0.02275 | 0.05097 | 2.24042 |

-3 | 0.00135 | 0.01505 | 11.14871 |

-4 | 0.00003 | 0.00516 | 162.97775 |

-5 | 0.00000 | 0.00205 | 7159.76534 |

Note how low \(z\)-scores are more probable in this Student’s \(t\) than in the Gaussian. This is most apparent in the `Student/Gauss ratio`

column on the right. A consequence of this is that extreme scores are less influential to your solutions when you use a small-\(\nu\) Student’s \(t\)-distribution in place of the Gaussian. That is, the small-\(\nu\) Student’s \(t\) is more robust than the Gaussian to unusual and otherwise influential observations.

In order to demonstrate, let’s simulate our own. We’ll start by creating multivariate normal data.

## Let’s create our initial tibble of well-behaved data, `d`

First, we’ll need to define our variance/covariance matrix.

```
s <- matrix(c(1, .6,
.6, 1),
nrow = 2, ncol = 2)
```

By the two `.6`

s on the off-diagonal positions, we indicated we’d like our two variables to have a correlation of .6.

Second, our variables also need means, which we’ll define with a mean vector.

`m <- c(0, 0)`

With means of `0`

and variances of `1`

, our data are in a standardized metric.

Third, we’ll use the `mvrnorm()`

function from the **MASS** package (Ripley, 2019) to simulate our data.

```
set.seed(3)
d <- MASS::mvrnorm(n = 100, mu = m, Sigma = s) %>%
as_tibble() %>%
rename(y = V1, x = V2)
```

The first few rows look like so:

`head(d)`

```
## # A tibble: 6 x 2
## y x
## <dbl> <dbl>
## 1 -1.14 -0.584
## 2 -0.0805 -0.443
## 3 -0.239 0.702
## 4 -1.30 -0.761
## 5 -0.280 0.630
## 6 -0.245 0.299
```

As an aside, check out this nice r-bloggers post for more information on simulating data with this method.

Anyway, this line reorders our data by `x`

, placing the smallest values on top.

```
d <-
d %>%
arrange(x)
head(d)
```

```
## # A tibble: 6 x 2
## y x
## <dbl> <dbl>
## 1 -2.21 -1.84
## 2 -1.27 -1.71
## 3 -0.168 -1.60
## 4 -0.292 -1.46
## 5 -0.785 -1.40
## 6 -0.157 -1.37
```

## Let’s create our outlier tibble, `o`

Here we’ll make two outlying and unduly influential values.

```
o <- d
o[c(1:2), 1] <- c(5, 4.5)
head(o)
```

```
## # A tibble: 6 x 2
## y x
## <dbl> <dbl>
## 1 5 -1.84
## 2 4.5 -1.71
## 3 -0.168 -1.60
## 4 -0.292 -1.46
## 5 -0.785 -1.40
## 6 -0.157 -1.37
```

With the code, above, we replaced the first two values of our first variable, `y`

. They both started out quite negative. Now they are positive values of a large magnitude within the standardized metric.

## Frequentist OLS models

To get a quick sense of what we’ve done, we’ll first fit two models with OLS regression via the `lm()`

function. The first model, `ols0`

, is of the multivariate normal data, `d`

. The second model, `ols1`

, is on the otherwise identical data with the two odd and influential values, `o`

. Here is our model code.

```
ols0 <- lm(data = d, y ~ 1 + x)
ols1 <- lm(data = o, y ~ 1 + x)
```

We’ll use the **broom** package (Robinson & Hayes, 2020) to assist with model summaries and other things. Here are the parameter estimates for the first model.

```
library(broom)
tidy(ols0) %>% mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.01 0.09 -0.08 0.94
## 2 x 0.45 0.1 4.55 0
```

And now the parameters for the second model, the one based on the `o`

outlier data.

`tidy(ols1) %>% mutate_if(is.double, round, digits = 2)`

```
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.13 0.11 1.15 0.25
## 2 x 0.14 0.13 1.1 0.27
```

Just two odd and influential values dramatically changed the model parameters, particularly the slope. Let’s plot the data and the models to get a visual sense of what happened.

```
# the well-behaved data
p1 <-
ggplot(data = d, aes(x = x, y = y)) +
stat_smooth(method = "lm", color = "grey92", fill = "grey67", alpha = 1, fullrange = T) +
geom_point(size = 1, alpha = 3/4) +
scale_x_continuous(limits = c(-4, 4)) +
coord_cartesian(xlim = c(-3, 3),
ylim = c(-3, 5)) +
labs(title = "No Outliers") +
theme(panel.grid = element_blank())
# the data with two outliers
p2 <-
ggplot(data = o, aes(x = x, y = y, color = y > 3)) +
stat_smooth(method = "lm", color = "grey92", fill = "grey67", alpha = 1, fullrange = T) +
geom_point(size = 1, alpha = 3/4) +
scale_color_viridis_d(option = "A", end = 4/7) +
scale_x_continuous(limits = c(-4, 4)) +
coord_cartesian(xlim = c(-3, 3),
ylim = c(-3, 5)) +
labs(title = "Two Outliers") +
theme(panel.grid = element_blank(),
legend.position = "none")
# combine the ggplots with patchwork syntax
library(patchwork)
p1 + p2
```

The two outliers were quite influential on the slope. It went from a nice clear diagonal to almost horizontal. You’ll also note how the 95% intervals (i.e., the bowtie shapes) were a bit wider when based on the `o`

data.

One of the popular ways to quantify outlier status is with Mahalanobis’ distance. However, the Mahalanobis distance is primarily valid for multivariate normal data. Though the data in this example are indeed multivariate normal–or at least they were before we injected two outlying values into them–I am going to resist relying on Mahalanobis’ distance. There are other more general approaches that will be of greater use when you need to explore other variants of the generalized linear model. The `broom::augment()`

function will give us access to one.

```
aug0 <- augment(ols0)
aug1 <- augment(ols1)
glimpse(aug1)
```

```
## Rows: 100
## Columns: 8
## $ y <dbl> 5.00000000, 4.50000000, -0.16783167, -0.29164105, -0.784918…
## $ x <dbl> -1.8439208, -1.7071418, -1.5996509, -1.4601550, -1.3954395,…
## $ .fitted <dbl> -0.129991947, -0.110805943, -0.095728191, -0.076161086, -0.…
## $ .resid <dbl> 5.12999195, 4.61080594, -0.07210348, -0.21547996, -0.717835…
## $ .hat <dbl> 0.05521164, 0.04881414, 0.04412882, 0.03849763, 0.03605748,…
## $ .sigma <dbl> 0.9887858, 1.0170749, 1.1246348, 1.1244384, 1.1222070, 1.12…
## $ .cooksd <dbl> 6.500952e-01, 4.580898e-01, 1.002809e-04, 7.721988e-04, 7.9…
## $ .std.resid <dbl> 4.71688666, 4.22522826, -0.06591171, -0.19639831, -0.653439…
```

Here we can compare the observations with Cook’s distance, \(D_i\) (i.e., `.cooksd`

). Cook’s \(D_i\) is a measure of the influence of a given observation on the model. To compute \(D_i\), the model is fit once for each \(n\) case, after first dropping that case. Then the difference in the model with all observations and the model with all observations but the \(i\)th observation, as defined by the Euclidean distance between the estimators. Fahrmeir et al (2013, p. 166) suggest that within the OLS framework “as a rule of thumb, observations with \(D_i > 0.5\) are worthy of attention, and observations with \(D_i > 1\) should always be examined.” Here we plot \(D_i\) against our observation index, \(i\), for both models.

```
bind_rows(
aug0 %>% mutate(i = 1:n()), # the well-behaved data
aug1 %>% mutate(i = 1:n()) # the data with two outliers
) %>%
mutate(fit = rep(c("fit b0", "fit b1"), each = n()/2)) %>%
ggplot(aes(x = i, y = .cooksd)) +
geom_hline(yintercept = .5, color = "white") +
geom_point(alpha = .5) +
geom_text(data = tibble(i = 46,
.cooksd = .53,
fit = "fit b0"),
label = "Fahrmeir et al said we might worry around here",
color = "grey50") +
coord_cartesian(ylim = c(0, .7)) +
theme(panel.grid = element_blank(),
axis.title.x = element_text(face = "italic", family = "Times")) +
facet_wrap(~ fit)
```

For the model of the well-behaved data, `ols0`

, we have \(D_i\) values all hovering near zero. However, the plot for `ols1`

shows one \(D_i\) value well above the 0.5 level and another not quite that high but deviant relative to the rest. Our two outlier values look quite influential for the results of `ols1`

.

## Switch to a Bayesian framework

It’s time to fire up **brms**, the package with which we’ll be fitting our Bayesian models. As with all Bayesian models, we’ll need to us use priors. To keep things simple, we’ll use weakly-regularizing priors of the sort discussed by the Stan team. For more thoughts on how to set priors, check out Kruschke’s (2015) text or either edition of McElreath’s text (2020, 2015).

`library(brms)`

### Stick with Gauss.

For our first two Bayesian models, `b0`

and `b1`

, we’ll use the conventional Gaussian likelihood (i.e., `family = gaussian`

in the `brm()`

function). Like with `ols0`

, above, the first model is based on the nice `d`

data. The second, `b1`

, is based on the more-difficult `o`

data.

```
b0 <-
brm(data = d,
family = gaussian,
y ~ 1 + x,
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(cauchy(0, 1), class = sigma)),
seed = 1)
b1 <-
update(b0,
newdata = o,
seed = 1)
```

Here are the model summaries.

`posterior_summary(b0)[1:3, ] %>% round(digits = 2)`

```
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept -0.01 0.09 -0.18 0.16
## b_x 0.44 0.10 0.25 0.64
## sigma 0.86 0.06 0.75 0.99
```

`posterior_summary(b1)[1:3, ] %>% round(digits = 2)`

```
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 0.13 0.11 -0.09 0.35
## b_x 0.14 0.13 -0.11 0.40
## sigma 1.13 0.08 0.98 1.29
```

We summarized our model parameters with `brms::posterior_summary()`

rather than `broom::tid()`

. Otherwise, these should look familiar. They’re very much like the results from the OLS models. Hopefully this isn’t surprising. Our priors were quite weak, so there’s no reason to suspect the results would differ much.

#### The LOO and other goodies help with diagnostics.

With the `loo()`

function, we’ll extract loo objects, which contain some handy output.

```
loo_b0 <- loo(b0)
loo_b1 <- loo(b1)
```

We’ll use `str()`

to get a sense of what’s all in there, using `loo_b1`

as an example.

`str(loo_b1)`

```
## List of 10
## $ estimates : num [1:3, 1:2] -157.41 6.65 314.81 15.76 3.75 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "elpd_loo" "p_loo" "looic"
## .. ..$ : chr [1:2] "Estimate" "SE"
## $ pointwise : num [1:100, 1:5] -13.47 -10.79 -1.06 -1.08 -1.27 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:5] "elpd_loo" "mcse_elpd_loo" "p_loo" "looic" ...
## $ diagnostics:List of 2
## ..$ pareto_k: num [1:100] 0.8171 0.6003 0.0139 -0.0705 -0.0817 ...
## ..$ n_eff : num [1:100] 71.1 186.8 2553.2 2795.7 3845.7 ...
## $ psis_object: NULL
## $ elpd_loo : num -157
## $ p_loo : num 6.65
## $ looic : num 315
## $ se_elpd_loo: num 15.8
## $ se_p_loo : num 3.75
## $ se_looic : num 31.5
## - attr(*, "dims")= int [1:2] 4000 100
## - attr(*, "class")= chr [1:3] "psis_loo" "importance_sampling_loo" "loo"
## - attr(*, "yhash")= chr "b52ef230de67f0bebc3480da360987ee2c0f4de8"
## - attr(*, "model_name")= chr "b1"
```

For a detailed explanation of all those elements, see the **loo** reference manual (Vehtari et al., 2020). For our purposes, we’ll focus on the `pareto_k`

. Here’s a glimpse of what it contains for the `b1`

model.

`loo_b1$diagnostics$pareto_k %>% as_tibble()`

```
## # A tibble: 100 x 1
## value
## <dbl>
## 1 0.817
## 2 0.600
## 3 0.0139
## 4 -0.0705
## 5 -0.0817
## 6 -0.00611
## 7 0.0431
## 8 0.00514
## 9 0.111
## 10 0.0629
## # … with 90 more rows
```

We’ve got us a numeric vector of as many values as our data had observations–100 in this case. The `pareto_k`

values can be used to examine overly-influential cases. See, for example this discussion on stackoverflow.com in which several members of the Stan team weighed in. The issue is also discussed in Vehtari et al. (2017), in the **loo** reference manual, and in this presentation by Aki Vehtari, himself. If we explicitly open the **loo** package (Vehtari et al., 2019), we can use a few convenience functions to leverage `pareto_k`

for diagnostic purposes. The `pareto_k_table()`

function will categorize the `pareto_k`

values and give us a sense of how many values are in problematic ranges.

```
library(loo)
pareto_k_table(loo_b1)
```

```
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 98 98.0% 2439
## (0.5, 0.7] (ok) 1 1.0% 187
## (0.7, 1] (bad) 1 1.0% 71
## (1, Inf) (very bad) 0 0.0% <NA>
```

Happily, most of our cases were in the “good” range. One pesky case was in the “bad” range [can you guess which one?] and another case was only “ok” [and can you guess that one, too?]. The `pareto_k_ids()`

function will tell exactly us which cases we’ll want to look at.

`pareto_k_ids(loo_b1)`

`## [1] 1 2`

Those numbers correspond to the row numbers in the data, `o`

. These are exactly the cases that plagued our second OLS model, `fit1`

, and are also the ones we hand coded to be outliers. With the simple `plot()`

function, we can get a diagnostic plot for the `pareto_k`

values.

`plot(loo_b1)`

There they are, cases 1 and 2, lurking in the “bad” and “[just] ok” ranges. We can also make a similar plot with **ggplot2**. Though it takes a little more work, **ggplot2** makes it easy to compare `pareto_k`

plots across models with a little faceting.

```
# for the annotation
text <-
tibble(i = 1,
k = c(.45, .65, .95),
label = c("good", "[just] ok", "bad"),
fit = "fit b0")
# extract the diagnostics
tibble(k = c(loo_b0$diagnostics$pareto_k, loo_b1$diagnostics$pareto_k),
i = rep(1:100, times = 2),
fit = rep(str_c("fit b", 0:1), each = 100)) %>%
# plot!
ggplot(aes(x = i, y = k)) +
geom_hline(yintercept = c(.5, .7, 1), color = "white") +
geom_point(alpha = .5) +
geom_text(data = text,
aes(label = label),
color = "grey50", hjust = 0) +
scale_y_continuous(expression(Pareto~italic(k)), breaks = c(0, .5, .7, 1)) +
theme(panel.grid = element_blank(),
axis.title.x = element_text(face = "italic", family = "Times")) +
facet_wrap(~ fit)
```

So with `b0`

–the model based on the well-behaved multivariate normal data, `d`

–, all the `pareto_k`

values hovered around zero in the “good” range. Things got concerning with model `b1`

. But we know all that. Let’s move forward.

#### What do we do with those overly-influential outlying values?

A typical way to handle outlying values is to delete them based on some criterion, such as the Mahalanobis distance, Cook’s \(D_i\), or our new friend the `pareto_k`

. In our next two models, we’ll do that. In our `data`

arguments, we can use the `slice()`

function to omit cases. In model `b1.1`

, we simply omit the first and most influential case. In model `b1.2`

, we omitted both unduly-influential cases, the values from rows 1 and 2.

```
b1.1 <-
update(b1,
newdata = o %>% slice(2:100),
seed = 1)
b1.2 <-
update(b1,
newdata = o %>% slice(3:100),
seed = 1)
```

Here are the summaries for our models based on the `slice[d]`

data.

`posterior_summary(b1.1)[1:3, ] %>% round(digits = 2)`

```
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 0.07 0.10 -0.12 0.27
## b_x 0.27 0.12 0.04 0.50
## sigma 1.00 0.07 0.87 1.15
```

`posterior_summary(b1.2)[1:3, ] %>% round(digits = 2)`

```
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 0.01 0.09 -0.16 0.19
## b_x 0.39 0.10 0.19 0.59
## sigma 0.86 0.06 0.75 0.99
```

They are closer to the true data generating model (i.e., the code we used to make `d`

), especially `b1.2`

. However, there are other ways to handle the influential cases without dropping them. Finally, we’re ready to switch to Student’s \(t\)!

### Time to leave Gauss for the more general Student’s \(t\)

Recall that the normal distribution is equivalent to a Student’s \(t\) with the degrees of freedom parameter, \(\nu\), set to infinity. That is, \(\nu\) is fixed. Here we’ll relax that assumption and estimate \(\nu\) from the data just like we estimate \(\mu\) with the linear model and \(\sigma\) as the residual spread. Since \(\nu\)’s now a parameter, we’ll have to give it a prior. For our first Student’s \(t\) model, we’ll estimate \(\nu\) with the **brms** default `gamma(2, 0.1)`

prior.

```
b2 <-
brm(data = o, family = student,
y ~ 1 + x,
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(gamma(2, 0.1), class = nu),
prior(cauchy(0, 1), class = sigma)),
seed = 1)
```

For the next model, we’ll switch out that weak `gamma(2, 0.1)`

for a stronger `gamma(4, 1)`

. In some disciplines, the gamma distribution is something of an exotic bird. So before fitting the model, it might be useful to take a peek at what these gamma priors looks like. In the plot, below, the orange density in the background is the default `gamma(2, 0.1)`

and the purple density in the foreground is the stronger `gamma(4, 1)`

.

```
# data
tibble(x = seq(from = 0, to = 60, by = .1)) %>%
expand(x, nesting(alpha = c(2, 4),
beta = c(0.1, 1))) %>%
mutate(density = dgamma(x, alpha, beta),
group = rep(letters[1:2], times = n() / 2)) %>%
# plot
ggplot(aes(x = x, ymin = 0, ymax = density,
group = group, fill = group)) +
geom_ribbon(size = 0, alpha = 3/4) +
scale_fill_viridis_d(option = "B", direction = -1,
begin = 1/3, end = 2/3) +
scale_x_continuous(expand = expansion(mult = c(0, 0.05))) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 50)) +
theme(panel.grid = element_blank(),
legend.position = "none")
```

So the default prior is centered around values in the 2 to 30 range, but has a long gentle-sloping tail, allowing the model to yield much larger values for \(\nu\), as needed. The prior we use below is almost entirely concentrated in the single-digit range. In this case, that will preference Student’s \(t\) likelihoods with very small \(\nu\) parameters and correspondingly thick tails–easily allowing for extreme values.

```
b3 <-
update(b2,
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(gamma(4, 1), class = nu),
prior(cauchy(0, 1), class = sigma)),
seed = 1)
```

For our final model, we’ll fix the \(\nu\) parameter in a `bf()`

statement.

```
b4 <-
brm(data = o, family = student,
bf(y ~ 1 + x, nu = 4),
prior = c(prior(normal(0, 100), class = Intercept),
prior(normal(0, 10), class = b),
prior(cauchy(0, 1), class = sigma)),
seed = 1)
```

Now we’ve got all those models, we can gather their results into a single tibble.

```
b_estimates <-
tibble(model = c("b0", "b1", "b1.1", "b1.2", "b2", "b3", "b4")) %>%
mutate(fit = map(model, get)) %>%
mutate(posterior_summary = map(fit, ~posterior_summary(.) %>%
data.frame() %>%
rownames_to_column("term"))) %>%
unnest(posterior_summary) %>%
select(-fit) %>%
filter(term %in% c("b_Intercept", "b_x")) %>%
arrange(term)
```

To get a sense of what we’ve done, let’s take a peek at our models tibble.

```
b_estimates %>%
mutate_if(is.double, round, digits = 2) # this is just to round the numbers
```

```
## # A tibble: 14 x 6
## model term Estimate Est.Error Q2.5 Q97.5
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 b0 b_Intercept -0.01 0.09 -0.18 0.16
## 2 b1 b_Intercept 0.13 0.11 -0.09 0.35
## 3 b1.1 b_Intercept 0.07 0.1 -0.12 0.27
## 4 b1.2 b_Intercept 0.01 0.09 -0.16 0.19
## 5 b2 b_Intercept 0.04 0.09 -0.14 0.23
## 6 b3 b_Intercept 0.04 0.09 -0.14 0.22
## 7 b4 b_Intercept 0.04 0.09 -0.14 0.22
## 8 b0 b_x 0.44 0.1 0.25 0.64
## 9 b1 b_x 0.14 0.13 -0.11 0.4
## 10 b1.1 b_x 0.27 0.12 0.04 0.5
## 11 b1.2 b_x 0.39 0.1 0.19 0.59
## 12 b2 b_x 0.36 0.11 0.15 0.56
## 13 b3 b_x 0.36 0.1 0.16 0.56
## 14 b4 b_x 0.37 0.11 0.16 0.580
```

The models differ by their intercepts, slopes, sigmas, and \(\nu\)s. For the sake of this post, we’ll focus on the slopes. Here we compare the different Bayesian models’ slopes by their posterior means and 95% intervals in a coefficient plot.

```
b_estimates %>%
filter(term == "b_x") %>% # b_Intercept b_x
ggplot(aes(x = model)) +
geom_pointrange(aes(y = Estimate,
ymin = Q2.5,
ymax = Q97.5),
shape = 20) +
coord_flip(ylim = c(-.2, 1)) +
labs(title = "The x slope, varying by model",
subtitle = "The dots are the posterior means and the lines the percentile-based 95% intervals.",
x = NULL,
y = NULL) +
theme(panel.grid = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_text(hjust = 0))
```

You might think of the `b0`

slope as the “true” slope. That’s the one estimated from the well-behaved multivariate normal data, `d`

. That estimate’s just where we’d want it to be. The `b1`

slope is a disaster–way lower than the others. The slopes for `b1.1`

and `b1.2`

get better, but at the expense of deleting data. All three of our Student’s \(t\) models produced slopes that were pretty close to the `b0`

slope. They weren’t perfect, but, all in all, Student’s \(t\)-distribution did pretty okay.

### We need more LOO and more `pareto_k`

.

We already have loo objects for our first two models, `b0`

and `b1`

. Let’s get some for models `b2`

through `b4`

.

```
loo_b2 <- loo(b2)
loo_b3 <- loo(b3)
loo_b4 <- loo(b4)
```

With a little data wrangling, we can compare our models by how they look in our custom `pareto_k`

diagnostic plots.

```
# make a custom function to work with the loo objects in bulk
get_pareto_k <- function(l) {
l$diagnostics$pareto_k %>%
as_tibble() %>%
mutate(i = 1:n()) %>%
rename(pareto_k = value)
}
# wrangle
tibble(name = str_c("loo_b", 1:4)) %>%
mutate(loo_object = map(name, get)) %>%
mutate(pareto_k = map(loo_object, get_pareto_k)) %>%
unnest(pareto_k) %>%
mutate(fit = rep(c("fit b1", "fit b2", "fit b3", "fit b4"), each = n() / 4)) %>%
# plot
ggplot(aes(x = i, y = pareto_k)) +
geom_hline(yintercept = c(.5, .7),
color = "white") +
geom_point(alpha = .5) +
scale_y_continuous(expression(Pareto~italic(k)), breaks = c(0, .5, .7)) +
theme(panel.grid = element_blank(),
axis.title.x = element_text(face = "italic", family = "Times")) +
facet_wrap(~ fit)
```

Oh man, those Student’s \(t\) models worked sweet! In a succession from `b2`

through `b4`

, each model looked better by `pareto_k`

. All were way better than the typical Gaussian model, `b1`

. While we’re at it, we might compare those by their LOO values.

`loo_compare(loo_b1, loo_b2, loo_b3, loo_b4) %>% print(simplify = F)`

```
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
## b4 0.0 0.0 -143.5 10.3 2.7 0.3 287.1 20.7
## b3 -0.8 0.4 -144.4 10.7 3.6 0.8 288.8 21.4
## b2 -1.9 1.8 -145.5 11.7 4.6 1.5 291.0 23.4
## b1 -13.9 7.6 -157.4 15.8 6.7 3.7 314.8 31.5
```

In terms of the LOO, `b2`

through `b4`

were about the same, but all looked better than `b1`

. In fairness, though, the standard errors for the difference scores were a bit on the wide side.
If you’re new to using information criteria to compare models, you might sit down and soak in one of McElreath’s lectures on the topic or the (2020) vignette by Vehtari and Gabry, *Using the loo package (version >= 2.0.0)*. For a more technical introduction, you might check out the references in the **loo** package’s reference manual.

For one final LOO-related comparison, we can use the `brms::model_weights()`

function to see how much relative weight we might put on each of those four models if we were to use a model averaging approach. Here we use the default method, which is model averaging via posterior predictive stacking.

`model_weights(b1, b2, b3, b4)`

```
## b1 b2 b3 b4
## 3.310561e-07 8.617446e-07 1.808979e-06 9.999970e-01
```

If you’re not a fan of scientific notation, just tack on `round(digits = 2)`

. The stacking method suggests that we should place virtually all the weight on `b4`

, the model in which we fixed our Student-\(t\) \(\nu\) parameter at 4. To learn more about model stacking, check out Yao, Vehtari, Simpson, and Gelman’s (2018) paper, *Using stacking to average Bayesian predictive distributions*.

### Let’s compare a few Bayesian models.

That’s enough with coefficients, `pareto_k`

, and the LOO. Let’s get a sense of the implications of the models by comparing a few in plots. Here we use convenience functions from Matthew Kay’s (2020) **tidybayes** package to streamline the data wrangling and plotting. The method came from a kind twitter suggesion from Kay.

```
library(tidybayes)
# these are the values of x we'd like model-implied summaries for
nd <- tibble(x = seq(from = -4, to = 4, length.out = 50))
# here's another way to arrange the models
list(b0 = b0, b1 = b1, b3 = b3) %>%
# with help from `tidybayes::add_fitted_draws()`, here we use `fitted()` in bulk
map_dfr(add_fitted_draws, newdata = nd, .id = "model") %>%
# plot
ggplot(aes(x = x)) +
stat_lineribbon(aes(y = .value),
.width = .95,
color = "grey92", fill = "grey67") +
geom_point(data = d %>%
bind_rows(o, o) %>%
mutate(model = rep(c("b0", "b1", "b3"), each = 100)),
aes(y = y, color = y > 3),
size = 1, alpha = 3/4) +
scale_color_viridis_d(option = "A", end = 4/7) +
coord_cartesian(xlim = c(-3, 3),
ylim = c(-3, 5)) +
ylab(NULL) +
theme(panel.grid = element_blank(),
legend.position = "none") +
facet_wrap(~ model)
```

For each subplot, the gray band is the 95% interval band and the overlapping light gray line is the posterior mean. Model `b0`

, recall, is our baseline comparison model. This is of the well-behaved no-outlier data, `d`

, using the good old Gaussian likelihood. Model `b1`

is of the outlier data, `o`

, but still using the non-robust Gaussian likelihood. Model `b3`

uses a robust Student’s \(t\) likelihood with \(\nu\) estimated with the fairly narrow `gamma(4, 1)`

prior. For my money, `b3`

did a pretty good job.

`sessionInfo()`

```
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidybayes_2.3.1 loo_2.4.1 brms_2.15.0 Rcpp_1.0.6
## [5] patchwork_1.1.1 broom_0.7.5 forcats_0.5.1 stringr_1.4.0
## [9] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [13] tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 plyr_1.8.6
## [4] igraph_1.2.6 svUnit_1.0.3 splines_4.0.4
## [7] crosstalk_1.1.0.1 TH.data_1.0-10 rstantools_2.1.1
## [10] inline_0.3.17 digest_0.6.27 htmltools_0.5.1.1
## [13] rsconnect_0.8.16 fansi_0.4.2 checkmate_2.0.0
## [16] magrittr_2.0.1 modelr_0.1.8 RcppParallel_5.0.2
## [19] matrixStats_0.57.0 xts_0.12.1 sandwich_3.0-0
## [22] prettyunits_1.1.1 colorspace_2.0-0 rvest_0.3.6
## [25] ggdist_2.4.0.9000 haven_2.3.1 xfun_0.22
## [28] callr_3.5.1 crayon_1.4.1 jsonlite_1.7.2
## [31] lme4_1.1-25 survival_3.2-10 zoo_1.8-8
## [34] glue_1.4.2 gtable_0.3.0 emmeans_1.5.2-1
## [37] V8_3.4.0 distributional_0.2.2 pkgbuild_1.2.0
## [40] rstan_2.21.2 abind_1.4-5 scales_1.1.1
## [43] mvtnorm_1.1-1 DBI_1.1.0 miniUI_0.1.1.1
## [46] viridisLite_0.3.0 xtable_1.8-4 stats4_4.0.4
## [49] StanHeaders_2.21.0-7 DT_0.16 htmlwidgets_1.5.2
## [52] httr_1.4.2 threejs_0.3.3 arrayhelpers_1.1-0
## [55] ellipsis_0.3.1 pkgconfig_2.0.3 farver_2.0.3
## [58] dbplyr_2.0.0 utf8_1.1.4 tidyselect_1.1.0
## [61] labeling_0.4.2 rlang_0.4.10 reshape2_1.4.4
## [64] later_1.1.0.1 munsell_0.5.0 cellranger_1.1.0
## [67] tools_4.0.4 cli_2.3.1 generics_0.1.0
## [70] ggridges_0.5.2 evaluate_0.14 fastmap_1.0.1
## [73] yaml_2.2.1 processx_3.4.5 knitr_1.31
## [76] fs_1.5.0 nlme_3.1-152 mime_0.10
## [79] projpred_2.0.2 xml2_1.3.2 compiler_4.0.4
## [82] bayesplot_1.8.0 shinythemes_1.1.2 rstudioapi_0.13
## [85] gamm4_0.2-6 curl_4.3 reprex_0.3.0
## [88] statmod_1.4.35 stringi_1.5.3 highr_0.8
## [91] ps_1.6.0 blogdown_1.3 Brobdingnag_1.2-6
## [94] lattice_0.20-41 Matrix_1.3-2 nloptr_1.2.2.2
## [97] markdown_1.1 shinyjs_2.0.0 vctrs_0.3.6
## [100] pillar_1.5.1 lifecycle_1.0.0 bridgesampling_1.0-0
## [103] estimability_1.3 httpuv_1.5.4 R6_2.5.0
## [106] bookdown_0.21 promises_1.1.1 gridExtra_2.3
## [109] codetools_0.2-18 boot_1.3-26 colourpicker_1.1.0
## [112] MASS_7.3-53 gtools_3.8.2 assertthat_0.2.1
## [115] withr_2.4.1 shinystan_2.5.0 multcomp_1.4-16
## [118] mgcv_1.8-33 parallel_4.0.4 hms_0.5.3
## [121] grid_4.0.4 coda_0.19-4 minqa_1.2.4
## [124] rmarkdown_2.7 shiny_1.5.0 lubridate_1.7.9.2
## [127] base64enc_0.1-3 dygraphs_1.1.1.6
```

## References

*Journal of Statistical Software*,

*80*(1), 1–28. https://doi.org/10.18637/jss.v080.i01

*The R Journal*,

*10*(1), 395–411. https://doi.org/10.32614/RJ-2018-017

*brms: Bayesian regression models using ’Stan’*. https://CRAN.R-project.org/package=brms

*Regression: Models, methods and applications*. Springer-Verlag. https://doi.org/10.1007/978-3-642-34333-9

*Data analysis using regression and multilevel/hierarchical models*. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942

*tidybayes: Tidy data and ’geoms’ for Bayesian models*. https://mjskay.github.io/tidybayes/

*Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan*. Academic Press. https://sites.google.com/site/doingbayesiandataanalysis/

*Statistical rethinking: A Bayesian course with examples in R and Stan*(Second Edition). CRC Press. https://xcelab.net/rm/statistical-rethinking/

*Statistical rethinking: A Bayesian course with examples in R and Stan*. CRC press. https://xcelab.net/rm/statistical-rethinking/

*MASS: Support functions and datasets for venables and ripley’s MASS*. https://CRAN.R-project.org/package=MASS

*broom: Convert statistical analysis objects into tidy tibbles*[Manual]. https://CRAN.R-project.org/package=broom

*Using the loo package (version \(>\)= 2.0.0)*. https://CRAN.R-project.org/package=loo/vignettes/loo2-example.html

*loo reference manual, Version 2.3.1*. https://CRAN.R-project.org/package=loo/loo.pdf

*loo: Efficient leave-one-out cross-validation and WAIC for bayesian models*. https://CRAN.R-project.org/package=loo/

*Statistics and Computing*,

*27*(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

*tidyverse: Easily install and load the ’tidyverse’*. https://CRAN.R-project.org/package=tidyverse

*Journal of Open Source Software*,

*4*(43), 1686. https://doi.org/10.21105/joss.01686

*Bayesian Analysis*,

*13*(3), 917–1007. https://doi.org/10.1214/17-BA1091