Set your sigma prior when you know very little about your sumscore data
By A. Solomon Kurz
December 1, 2022
What?
We psychologists analyze a lot of sumscore data. Even though it’s not the best, we usually use the Gaussian likelihood^{1} for these analyses. I was recently in a situation were I wanted to model sumscores from a new questionnaire and there was no good prior research on the distribution of the sumscores. Like, there wasn’t a single published paper reporting the sample statistics. Crazy, I know… Anyway, I spent some time trying to reason through how I would set a justifiable prior for my \(\sigma\)
parameter, and in this blog post we’ll cover how I came up with a solution.
I make assumptions.

For this post, I’m presuming you are familiar with Bayesian regression. For an introduction, I recommend either edition of McElreath’s text ( 2020, 2015); Kruschke’s ( 2015) text; or Gelman, Hill, and Vehtari’s ( 2020) text.

All code is in R. Data wrangling and plotting were done with help from the tidyverse ( Wickham et al., 2019; Wickham, 2022), tidybayes ( Kay, 2022), and patchwork ( Pedersen, 2022). The Bayesian model will be fit with brms ( Bürkner, 2017, 2018, 2022). We will also use the MetBrewer package ( Mills, 2022) to select the color palette for our figures.
Here we load the packages and adjust the global plotting theme.
# load
library(tidyverse)
library(tidybayes)
library(patchwork)
library(brms)
library(MetBrewer)
# save a color vector
h < met.brewer(colorblind_palettes[8]) # "Hiroshige"
# adjust the global plotting theme
theme_set(
theme_linedraw(base_size = 13) +
theme(panel.background = element_rect(fill = alpha(h[5], 0.33)),
panel.border = element_rect(linewidth = 0.33),
panel.grid = element_blank(),
plot.background = element_rect(fill = alpha(h[5], 0.2)),
strip.background = element_rect(fill = alpha(h[5], 0.67), linewidth = 0),
strip.text = element_text(color = "black"))
)
The color palette in this post is inspired by Utagawa Hiroshige’s Sailing Boats Returning to Yabase, Lake Biwa, about which you can learn more from The Met’s website.
Overview
I came up with three broad strategies for settling in on a prior for \(\sigma\)
when in a state of relative ignorance:
 Compute the maximum value with Popoviciu’s inequality.
 Compute an approximate standard deviation presuming a uniform distribution.
 Compute standard deviations based on plausible betabinomial distributions.
We’ll explore each in turn.
Popoviciu’s inequality.
If you have continuous data with clear lower and upper boundaries, you can use Popoviciu’s (
1935)
inequality on variances to compute the maximum variance of the distribution. If we let \(U\)
be the upper bound and \(L\)
be the lower bound, Popoviciu’s inequality states
$$ \sigma^2 \leq \frac{1}{4} (U  L)^2. $$
For example, let’s say you’re using a questionnaire like the PHQ9 ( Spitzer et al., 1999), which is a widelyused depression questionnaire^{2} composed of 9 Likerttype items ranging from 0 to 3. The PHQ9 sum score has a possible range of 0 to 27, with higher scores indicating deeper depression. Using Popoviciu’s inequality, we can compute the maximum variance for the PHQ9 sum score like so.
u < 27 # upper limit
l < 0 # lower limit
(u  l)^2 / 4 # maximum variance
## [1] 182.25
Since variances are the same as standard deviations squared, we can take the square root of that value to compute the maximum standard deviation possible for the PHQ9.
sqrt((u  l)^2 / 4) # maximum standard deviation
## [1] 13.5
We might check this out with a little simulation. If we had a data set composed of \(N = 10{,}000\)
PHQ9 sumscore values which were evenly split between all zero’s and all 27’s, here’s what the sample standard deviation would be.
n < 10000
tibble(y = rep(c(0, 27), each = n / 2)) %>%
summarise(var = var(y),
sd = sd(y))
## # A tibble: 1 × 2
## var sd
## <dbl> <dbl>
## 1 182. 13.5
The displayed output values are rounded, but the sample statistics are just a hair above the population values from Popoviciu’s inequality^{3}.
When in a state of complete ignorance, you know your \(\sigma\)
prior should be centered somewhere between zero and the value returned by Popoviciu’s inequality. Though I wouldn’t recommend it, one way to use this information would be to set a uniform prior ranging from zero to the upperlimit based on Popoviciu’s inequality.
Take the standard deviation from the uniform distribution.
Though Popoviciu’s inequality gives us a sense of the range of possible values, it doesn’t do a great job helping us decide which parts of that range are more likely than not. As a next stab, you might assume the sumscore data were uniformly distributed. This would be something of a null hypothesis stating: All values are equally likely.
If you have a uniform distribution with a lower limit of \(L\)
and an upper limit of \(U\)
, you can compute the variance with the equation
$$ \sigma^2 = \frac{1}{12} (U  L)^2, $$
with the standard deviation being the square root of that value. Continuing on with our PHQ9 data as an example, here are the variance and standard deviation presuming they were uniformly distributed.
u < 27 # upper limit
l < 0 # lower limit
(u  l)^2 / 12 # variance
## [1] 60.75
sqrt((u  l)^2 / 12) # sd
## [1] 7.794229
Again, let’s run a little simulation to see how this shakes out.
n_option < 1000
tibble(y = rep(0:27, each = n_option)) %>%
summarise(var = var(y),
sd = sd(y))
## # A tibble: 1 × 2
## var sd
## <dbl> <dbl>
## 1 65.3 8.08
In this case, the sample statistics from our simulation are a little bit off from the statistics we’d expect based on the formula from the uniform distribution. Here, though, keep in mind that the uniform distribution presumes truly continuous data, not integer values like those you’d get from a sum score. Thus the uniform distribution is only an approximation, not a true analogue. However, this isn’t a bad approach if you just want to get a sense of where to start.
Thus Popoviciu’s inequality gave us a possible range for \(\sigma\)
, and the uniform distribution gave us an approximate value given the null assumption all sumscores are equally plausible.
But do you really want to use the null assumption that all sumscores are equally plausible? If not, we’ll want to work with a new distribution.
Sum scores might look betabinomial.
So even though we tend to use the Gaussian likelihood to model sumscore data for reasons^{4}, it’s not the only distribution in town. If you search around in the backwater of applied statistics, you’ll find papers on the betabinomial likelihood for sumscoretype data. For example, Lord discussed the betabinomial distribution for psychological test data in ( 1962), Wilcox later reviewed it for the same purpose in ( 1981), and Carlin and Rubin kept the conversation rolling in ( 1991). Though these authors didn’t focus on Likerttype data per se, Greenleaf used the betabinomial to describe a distribution of sums of Likerttype items in ( 1992). I’d like to extrapolate further and suggest the betabinomial distribution can help us understand sumscore data in general, and the standard deviations of sumscore data in particular.
Given some count variable \(y\)
with a clear upper bound \(n\)
, we can describe the betabinomial likelihood as
$$ f(y  n, \alpha, \beta) = \binom{n}{y} \frac{\operatorname B (y + \alpha, n  y + \beta)}{\operatorname B(\alpha, \beta)}, $$
where \(\alpha\)
and \(\beta\)
are the two parameters from the canonical version of the beta likelihood, \(\operatorname B(\cdot)\)
is the
beta function, and \(\tbinom{n}{y}\)
is a shorthand factorial notation. The \(\alpha\)
and \(\beta\)
parameters aren’t the most intuitive, but for our purposes the main thing to understand is when the mean of the distribution is in the middle of the range, \(\alpha = \beta\)
. I propose that when you are in a state of ignorance about a new sumscore, your goto assumption should be the mean is in the middle. Why? Well, you have to start somewhere and the middle is at least as good a place as any other. But also keep in mind that most questionnaire developers have had at least some^{5} training in psychometrics, and they know that it’s generally a good idea to develop measures with approximately bellshaped and symmetric distributions. Thus even of you’ve never seen prior examples of a given sum score, a good initial guess is the sample mean is somewhere in the middle of the possible range.
In the case of our PHQ9 example where the lower limit is zero and the upper limit is 27, we could presume the data were uniformly distributed with \(\operatorname{BetaBinomial}(27, 1, 1)\)
. We can simulate data of that kind with the rbetabinom.ab()
function from the VGAM package (
Yee, 2022). Here’s a sample of \(N = 10{,}000\)
.
set.seed(9)
d < tibble(phq = VGAM::rbetabinom.ab(n = 1e4, size = 27, shape1 = 1, shape2 = 1))
# what?
head(d)
## # A tibble: 6 × 1
## phq
## <dbl>
## 1 20
## 2 23
## 3 16
## 4 17
## 5 10
## 6 22
Here’s what our simulated phq
values look like in a plot.
d %>%
ggplot(aes(x = phq)) +
geom_bar(fill = h[7]) +
scale_x_continuous(breaks = 0:3 * 9) +
ggtitle("10,000 draws from BetaBinomial(27, 1, 1)")
Here are the sample statistics.
d %>%
summarise(m = mean(phq),
v = var(phq),
s = sd(phq))
## # A tibble: 1 × 3
## m v s
## <dbl> <dbl> <dbl>
## 1 13.4 64.6 8.04
Now we’ve shifted from the exponential distribution to the betabinomial, we might point out the variance of the betabinomial is defined as
$$ \sigma^2 = \frac{n \alpha \beta (\alpha + \beta + n)}{(\alpha + \beta)^2 (\alpha + \beta + 1)}, $$
and the standard deviation of the betabinomial is just the square root of that value. Here’s what that looks like in the case of \(\operatorname{BetaBinomial}(27, 1, 1)\)
.
n < 27
a < 1
b < a
# population variance
((n * a * b) * (a + b + n)) / ((a + b)^2 * (a + b + 1))
## [1] 65.25
# population sd
sqrt(((n * a * b) * (a + b + n)) / ((a + b)^2 * (a + b + 1)))
## [1] 8.077747
So if we’re willing to describe the sum scores of our Likerttype items as integers, the flat betabinomial may give us a better first stab at the standard deviation than the exponential distribution approach, above.
But we can go further. The betabinomial \(\alpha\)
and \(\beta\)
parameters can take on any positive real values. For simplicity, consider what the betabinomial distribution looks like if we hold \(n = 27\)
and serially increase both \(\alpha\)
and \(\beta\)
from 1 to 9.
tibble(a = 1:9) %>%
mutate(b = a) %>%
expand(nesting(a, b),
phq = 0:27) %>%
mutate(d = VGAM::dbetabinom.ab(x = phq, size = 27, shape1 = a, shape2 = b)) %>%
mutate(label = str_c("BetaBinomial(27, ", a, ", ", b, ")")) %>%
ggplot(aes(x = phq, y = d)) +
geom_col(fill = h[7]) +
scale_x_continuous(breaks = 0:3 * 9) +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ label)
As the \(\alpha\)
and \(\beta\)
parameters increase, the distribution becomes more concentrated around the mean. I don’t know about you, but my experience has been that most roughlysymmetric sumscore data tend to look about like the distributions in \(\operatorname{BetaBinomial}(27, 3, 3)\)
through \(\operatorname{BetaBinomial}(27, 7, 7)\)
.
Using the formula from above, here are the expected standard deviations from each of the 9 distributions.
tibble(n = 27,
a = 1:9) %>%
mutate(b = a) %>%
mutate(variance = (n * a * b * (a + b + n)) / ((a + b)^2 * (a + b + 1))) %>%
mutate(standard_deviation = sqrt(variance))
## # A tibble: 9 × 5
## n a b variance standard_deviation
## <dbl> <int> <int> <dbl> <dbl>
## 1 27 1 1 65.2 8.08
## 2 27 2 2 41.8 6.47
## 3 27 3 3 31.8 5.64
## 4 27 4 4 26.2 5.12
## 5 27 5 5 22.7 4.76
## 6 27 6 6 20.2 4.5
## 7 27 7 7 18.4 4.30
## 8 27 8 8 17.1 4.13
## 9 27 9 9 16.0 4.00
If you’re in a total state of ignorance, it’s hard to know which one’s best. But I think we’re in the right ballpark, here.
Recap.
 Popoviciu’s inequality gave us a possible range for
\(\sigma\)
, which was between 0 and 13.5 for our PHQ9 example.  When we used the uniform distribution as a rough null default, the expected standard deviation was about 7.8.
 When we used plausible candidate betabinomial distributions, we got
\(\sigma\)
values ranging from 6 to 4, depending on which seems like the best default.
Thus, if I were to try modeling actual PHQ9 data for the first time, and I had no access to all the previous studies using the PHQ9, I’d set something like
$$ \sigma \sim \operatorname{Exponential}(1 / 5.641049), $$
based of the expected standard deviation from \(\operatorname{BetaBinomial}(27, 3, 3)\)
. Here’s what that prior looks like:
# 1 / 5.641049 is about 0.177272
prior(exponential(0.177272)) %>%
parse_dist() %>%
ggplot(aes(y = 0, dist = .dist, args = .args)) +
stat_halfeye(point_interval = mean_qi, .width = c(.5, .95),
fill = h[6]) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Exponential(1 / 5.641049)",
x = NULL)
Such a distribution puts the inner 50% of the prior mass between 1.6 and 7.8, which is centered where I want it. The exponential prior, by default, holds the standard deviation constant with the mean. If you wanted a stronger prior with a smaller standard deviation, you could switch to a 2parameter distribution like the gamma or lognormal.
# to compute the 50% and 95% ranges, execute this code
qexp(p = c(0.25, 0.75), rate = 1 / 5.641049) # 50% interval
qexp(p = c(0.025, 0.975), rate = 1 / 5.641049) # 95% interval
If you’re not familiar with using the exponential distribution for \(\sigma\)
priors, check out the second edition of McElreath’s
textbook. In chapter 4, he made the case the exponential distribution can be a good option when you have a sense of where the mean of the prior should be, but you’re unsure how large you want its spread. I’ve found it pretty handy over the past couple years.
Applied example
Okay, that’s enough theory. Let’s see how this works in practice.
We need data.
In response to a
call on twitter, neuroscience graduate student
Sam Zorowitz kindly shared some deidentified PHQ9 data collected by the good folks in the Niv Lab. I’ve saved the file on GitHub. Here we’ll load the data and save them as phq9
.
# load
phq9 < read_csv("data/phq9.csv")
# what is this?
glimpse(phq9)
## Rows: 4,500
## Columns: 4
## $ subject <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3…
## $ item <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3…
## $ order <dbl> 1, 6, 10, 3, 9, 8, 4, 5, 7, 3, 2, 8, 10, 4, 5, 1, 9, 6, 3, 5,…
## $ response <dbl> 1, 0, 2, 2, 0, 0, 1, 0, 0, 0, 1, 1, 2, 0, 1, 0, 0, 0, 2, 3, 3…
These data are a subset of \(n = 500\)
cases, which Zorowitz randomly drew from a larger pool of \(N \approx 10{,}000\)
cases. In a personal communication (11302022), Zorowitz reported his team plans to make the full parent data set publicly available in the near future. In the meantime, this \(n = 500\)
subset is more than enough for our purposes. In this file:
subject
: anonymized participant IDitem
: the unique item on the PHQ9order
: the position of the item on the webpage as presented to that participantresponse
: the participant’s response
As we’d expect, the response
values range from 0
to 3
.
phq9 %>%
count(response)
## # A tibble: 4 × 2
## response n
## <dbl> <int>
## 1 0 1939
## 2 1 1392
## 3 2 726
## 4 3 443
Happily, there are no missing values to contend with. Since our focus is on the sum scores, rather than the items, we’ll need to wrangle the data a bit.
phq9 < phq9 %>%
group_by(subject) %>%
summarise(phq = sum(response)) %>%
ungroup()
# what?
glimpse(phq9)
## Rows: 500
## Columns: 2
## $ subject <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ phq <dbl> 6, 5, 16, 3, 8, 9, 13, 13, 7, 4, 2, 4, 3, 13, 3, 7, 8, 0, 10, …
Now the sum scores are in the phq
column. Though we don’t technically need to, it will help me make my point if we subset the data further to \(n = 200\)
. We can take a random subset with the slice_sample()
function.
set.seed(9)
phq9 < phq9 %>%
slice_sample(n = 200)
# confirm the sample size
nrow(phq9)
## [1] 200
Here’s what the data look like.
phq9 %>%
ggplot(aes(x = phq)) +
geom_bar(fill = h[3]) +
scale_x_continuous(breaks = 0:3 * 9, limits = c(0.5, 27.5))
Based on my experience, this is a pretty typical distribution for the PHQ9.
We need a model.
For the sake of the blog, let’s keep things simple and fit an interceptonly model of the phq
scores. In the real world, I actually know a lot about the PHQ9 and there’s a mountain of prior studies reporting sample statistics for its sum score. But if this weren’t the case and we were trying to fit a model in a state of nearcomplete ignorance, I think a pretty great place to start would be
$$
\begin{align*} \text{PHQ9}_i & \sim \operatorname{Normal}(\mu, \sigma) \\ \mu & = \beta_0 \\ \beta_0 & \sim \operatorname{Normal}(13.5, 5.641049) \\ \sigma & \sim \operatorname{Exponential}(1 / 5.641049), \end{align*}
$$
where the top line indicates the PHQ9 scores vary across the \(i\)
cases, and the scores are described as following the normal distribution. The second line of the equation indicates this is an interceptonly model, with \(\beta_0\)
defining the intercept. Skipping the third line for a moment, the fourth line shows we’ve chosen an exponential prior for \(\sigma\)
for which the rate is \(1 / 5.641049\)
and thus the mean is \(5.641049\)
^{6}. Where did that value come from, again? After exploring some nice candidate betabinomial distributions, we said \(\operatorname{BetaBinomial}(27, 3, 3)\)
might be a good default option, and the expected standard deviation for that distribution is 5.641049. Here those are, again:
# define
n < 27
a < 3
b < a
tibble(phq = 0:27,
a = a,
b = b) %>%
mutate(d = VGAM::dbetabinom.ab(x = phq, size = n, shape1 = a, shape2 = b)) %>%
# plot
ggplot(aes(x = phq, y = d)) +
geom_col(fill = h[7]) +
scale_x_continuous(breaks = 0:3 * 9) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("BetaBinomial(27, 3, 3)")
# compute the population sd
sqrt(((n * a * b) * (a + b + n)) / ((a + b)^2 * (a + b + 1)))
## [1] 5.641049
Now back to the third line in the model equation. As per usual, we have used a Gaussian prior for \(\beta_0\)
. We’ve centered the prior on the middle of the possible range of the sum score, which is 13.5. Since we’ve decided 5.641049 is a pretty good value for \(\sigma\)
, it’s not a bad idea to use that value for the scale of our \(\beta_0\)
prior. Such a prior would have an inner 95% range of about 2.4 to 22.6, which covers most of the possible sumscore range.
qnorm(p = c(0.025, 0.975), mean = 13.5, sd = 5.641049)
## [1] 2.443747 24.556253
Otherwise put, the full range of the PHQ9 sum score is 13.5 plus or minus 2.4 standard deviations, as defined by our \(\sigma\)
prior.
13.5 / 5.641049
## [1] 2.393172
F* around and find out.^{7}
Here’s how to fit the model with brm()
.
fit1 < brm(
data = phq9,
family = gaussian,
phq ~ 1,
prior(normal(13.5, 5.641049), class = Intercept) +
prior(exponential(1 / 5.641049), class = sigma),
cores = 4, seed = 1
)
Check the model summary.
summary(fit1)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: phq ~ 1
## Data: phq9 (Number of observations: 200)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total postwarmup draws = 4000
##
## PopulationLevel Effects:
## Estimate Est.Error l95% CI u95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 9.04 0.41 8.24 9.83 1.00 3009 2271
##
## Family Specific Parameters:
## Estimate Est.Error l95% CI u95% CI Rhat Bulk_ESS Tail_ESS
## sigma 5.81 0.29 5.27 6.43 1.00 3468 2246
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
At a glance, everything looks fine. It might be more instructive to look the results in a couple plots where we compare the priors with the posteriors.
# intercept
p1 < as_draws_df(fit1) %>%
ggplot(aes(x = b_Intercept, y = after_stat(density))) +
geom_area(data = tibble(b_Intercept = seq(5, 30, by = 0.01),
density = dnorm(b_Intercept, 13.5, 5.641049)),
aes(y = density),
fill = h[10]) +
geom_histogram(fill = alpha(h[2], 2/3), boundary = 0, binwidth = 0.2) +
scale_x_continuous(expression(beta[0]), breaks = 0:3 * 9) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 27))
# sigma
p2 < as_draws_df(fit1) %>%
ggplot(aes(x = sigma, y = after_stat(density))) +
geom_area(data = tibble(sigma = seq(0, 20, by = 0.01),
density = dexp(sigma, 1 / 5.641049)),
aes(y = density),
fill = h[10]) +
geom_histogram(fill = alpha(h[2], 2/3), boundary = 0, binwidth = 0.1) +
geom_vline(xintercept = 13.5, linetype = 2, color = h[1]) +
annotate(geom = "text",
x = 13.3, y = 1.37,
label = "Upper limit via\nPopoviciu's inequality",
hjust = 1, size = 3) +
scale_x_continuous(expression(sigma), breaks = 0:3 * 5) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 15))
# combine
(p1  p2) &
plot_annotation(title = "Careful priors can express our substantive ignorance, but still\naccount for our methodological knowledge.",
subtitle = "The prior densities lurk in the background. The posteriors are the\nsemitransparent orange histograms in the foreground.")
It looks like both priors did their jobs well.
Objections
But the data were skewed!
Some readers might not like how we centered the prior for \(\beta_0\)
or how we used a symmetric betabinomial distribution to determine the prior for \(\sigma\)
given how the actual PHQ9 data were skewed to the right, and given how PHQ9 data are very often skewed to the right in other samples. Keep in mind that our focus is a strategy which will work when we don’t have information like that going in. When you’re completely ignorant of how your sumscore data tend to look, I recommend assuming they’ll be roughly symmetric.
However, it’s possible some of y’all work in fields where your sumscores are routinely skewed in one direction or the other. If so, great! That means you have useful domain knowledge upon which you can base your priors. One of the nice things about the betabinomial distribution is it can take on a variety of nonsymmetric shapes. For example, take a look at the distribution of \(\operatorname{BetaBinomial}(27, 1.5, 3.25)\)
.
# define
n < 27
a < 1.5
b < 3.25
tibble(phq = 0:27,
a = a,
b = b) %>%
mutate(d = VGAM::dbetabinom.ab(x = phq, size = n, shape1 = a, shape2 = b)) %>%
# plot
ggplot(aes(x = phq, y = d)) +
geom_col(fill = h[7]) +
scale_x_continuous(breaks = 0:3 * 9) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("BetaBinomial(27, 1.5, 3.25)")
This looks a lot like the empirical distribution in our phq9
data, doesn’t it? We could then use this skewed version of the betabinomial to compute the expected standard deviation.
# compute the population sd
sqrt(((n * a * b) * (a + b + n)) / ((a + b)^2 * (a + b + 1)))
## [1] 5.675623
So if you were working with questionnaire data in a domain where you expected the sumscore to have this kind of skewed distribution, you might assume \(\sigma \sim \operatorname{Exponential}(1 / 5.675623)\)
. It’s worth spending some time plotting betabinomial distributions with different values for its \(n\)
, \(\alpha\)
, and \(\beta\)
parameters. Scientists working in different research domains might prefer different default versions of the distribution.
Just standardize.
If you wanted to take a different approach, you could just standardize the sumscore data before the analysis and then fit the model to the standardized values. This is a fine practice and it would greatly simplify how you choose your priors. I do this on occasion, myself. However, sometimes I want to fit the model to the untransformed data and I’d like a workflow that can work on those occasions. This is such a workflow.
Plus, like, do you really care so little about your data and your research questions that you don’t want to put a little thought into what you expect the data will look like? If not, maybe science isn’t a good fit for you.^{8}
You sure about that Gauss?
Some readers might wonder whether it’s such a great idea to model sumscore data with the conventional Gaussian likelihood. The Gauss expects truly continuous data, and it doesn’t have a natural way to handle data with welldefined minimum and maximum values. The Gaussian likelihood might not be the best when dealing with markedlyskewed data, either.
One alternative would be using skewnormal or skewStudent likelihood. For details, see the ( 2017) preprint by Martin and Williams. This approach, however, will not fully solve the problem with the lower and upper boundaries. A more sophisticated approach would be to model the itemlevel data with a multilevelordinal IRTtype model, such as discussed by Bürkner ( 2020). This approach is excellent for respecting the ordinal nature of the questionnaire items and for expressing the datagenerating process, but it comes at the cost of a complex, highlyparameterized model which may be difficult to fit and explain in a manuscript.
If you’ve been following along, another option is to model the sum scores as betabinomial. This approach would account for skew, upper and lower boundaries, and for the integer values. The betabinomial would not faithfully reproduce the itemlevel datagenerating process, but you might think of it as a pragmatic and simpler alternative to the rigorous itemlevel multilevelordinal IRT approach. This approach is possible with brm()
by setting family = beta_binomial
. Though it’s beyond the scope of this blog post, I’ve been chipping away at a blog series which will explore the betabinomial distribution for sumscore data. As a preview, here’s how to fit such a model to these data.
fit2 < brm(
data = phq9,
family = beta_binomial,
phq  trials(27) ~ 1,
prior(normal(0, 1), class = Intercept) +
prior(gamma(2.25, 0.375), class = phi),
cores = 4, seed = 1
)
Whichever method you use in your research, I hope this post gave you some helpful ideas. Happy modeling, friends!
Session info
sessionInfo()
## R version 4.2.0 (20220422)
## Platform: x86_64appledarwin17.0 (64bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF8/en_US.UTF8/en_US.UTF8/C/en_US.UTF8/en_US.UTF8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MetBrewer_0.2.0 brms_2.18.0 Rcpp_1.0.9 patchwork_1.1.2
## [5] tidybayes_3.0.2 forcats_0.5.1 stringr_1.4.1 dplyr_1.0.10
## [9] purrr_0.3.4 readr_2.1.2 tidyr_1.2.1 tibble_3.1.8
## [13] ggplot2_3.4.0 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.1 backports_1.4.1 VGAM_1.17
## [4] plyr_1.8.7 igraph_1.3.4 svUnit_1.0.6
## [7] splines_4.2.0 crosstalk_1.2.0 TH.data_1.11
## [10] rstantools_2.2.0 inline_0.3.19 digest_0.6.30
## [13] htmltools_0.5.3 fansi_1.0.3 magrittr_2.0.3
## [16] checkmate_2.1.0 googlesheets4_1.0.1 tzdb_0.3.0
## [19] modelr_0.1.8 RcppParallel_5.1.5 matrixStats_0.62.0
## [22] vroom_1.5.7 xts_0.12.1 sandwich_3.02
## [25] prettyunits_1.1.1 colorspace_2.03 rvest_1.0.2
## [28] ggdist_3.2.0 haven_2.5.1 xfun_0.35
## [31] callr_3.7.3 crayon_1.5.2 jsonlite_1.8.3
## [34] lme4_1.131 survival_3.40 zoo_1.810
## [37] glue_1.6.2 gtable_0.3.1 gargle_1.2.0
## [40] emmeans_1.8.0 distributional_0.3.1 pkgbuild_1.3.1
## [43] rstan_2.21.7 abind_1.45 scales_1.2.1
## [46] mvtnorm_1.13 DBI_1.1.3 miniUI_0.1.1.1
## [49] xtable_1.84 bit_4.0.4 stats4_4.2.0
## [52] StanHeaders_2.21.07 DT_0.24 htmlwidgets_1.5.4
## [55] httr_1.4.4 threejs_0.3.3 arrayhelpers_1.10
## [58] posterior_1.3.1 ellipsis_0.3.2 pkgconfig_2.0.3
## [61] loo_2.5.1 farver_2.1.1 sass_0.4.2
## [64] dbplyr_2.2.1 utf8_1.2.2 labeling_0.4.2
## [67] tidyselect_1.1.2 rlang_1.0.6 reshape2_1.4.4
## [70] later_1.3.0 munsell_0.5.0 cellranger_1.1.0
## [73] tools_4.2.0 cachem_1.0.6 cli_3.4.1
## [76] generics_0.1.3 broom_1.0.1 ggridges_0.5.3
## [79] evaluate_0.18 fastmap_1.1.0 yaml_2.3.5
## [82] bit64_4.0.5 processx_3.8.0 knitr_1.40
## [85] fs_1.5.2 nlme_3.1159 mime_0.12
## [88] projpred_2.2.1 xml2_1.3.3 compiler_4.2.0
## [91] bayesplot_1.9.0 shinythemes_1.2.0 rstudioapi_0.13
## [94] gamm4_0.26 reprex_2.0.2 bslib_0.4.0
## [97] stringi_1.7.8 highr_0.9 ps_1.7.2
## [100] blogdown_1.15 Brobdingnag_1.28 lattice_0.2045
## [103] Matrix_1.41 nloptr_2.0.3 markdown_1.1
## [106] shinyjs_2.1.0 tensorA_0.36.2 vctrs_0.5.0
## [109] pillar_1.8.1 lifecycle_1.0.3 jquerylib_0.1.4
## [112] bridgesampling_1.12 estimability_1.4.1 httpuv_1.6.5
## [115] R6_2.5.1 bookdown_0.28 promises_1.2.0.1
## [118] gridExtra_2.3 codetools_0.218 boot_1.328
## [121] colourpicker_1.1.1 MASS_7.358.1 gtools_3.9.3
## [124] assertthat_0.2.1 withr_2.5.0 shinystan_2.6.0
## [127] multcomp_1.420 mgcv_1.840 parallel_4.2.0
## [130] hms_1.1.1 grid_4.2.0 coda_0.194
## [133] minqa_1.2.5 rmarkdown_2.16 googledrive_2.0.0
## [136] numDeriv_2016.81.1 shiny_1.7.2 lubridate_1.8.0
## [139] base64enc_0.13 dygraphs_1.1.1.6
References
Bürkner, P.C. (2020). Bayesian item response modeling in R with brms and Stan. http://arxiv.org/abs/1905.09501
Bürkner, P.C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01
Bürkner, P.C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ2018017
Bürkner, P.C. (2022). brms: Bayesian regression models using ’Stan’. https://CRAN.Rproject.org/package=brms
Carlin, J. B., & Rubin, D. B. (1991). Summarizing multiplechoice tests using three informative statistics. Psychological Bulletin, 110(2), 338–349. https://doi.org/10.1037/00332909.110.2.338
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879
Greenleaf, E. A. (1992). Measuring extreme response style. Public Opinion Quarterly, 56(3), 328–351. https://doi.org/10.1086/269326
Kay, M. (2022). tidybayes: Tidy data and ’geoms’ for Bayesian models. https://CRAN.Rproject.org/package=tidybayes
Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press. https://sites.google.com/site/doingbayesiandataanalysis/
Lord, F. M. (1962). Estimating true measurements from fallible measurements (binomial case)expansion in a series of beta distributions. ETS Research Bulletin Series, 1962(2), i–26. https://doi.org/10.1002/j.23338504.1962.tb00301.x
Martin, S. R., & Williams, D. R. (2017). Outgrowing the Procrustean bed of normality: The utility of Bayesian modeling for asymmetrical data analysis. https://doi.org/10.31234/osf.io/26m49
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan (Second Edition). CRC Press. https://xcelab.net/rm/statisticalrethinking/
McElreath, R. (2015). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press. https://xcelab.net/rm/statisticalrethinking/
Mills, B. R. (2022). MetBrewer: Color palettes inspired by works at the Metropolitan Museum of Art. https://CRAN.Rproject.org/package=MetBrewer
Pedersen, T. L. (2022). patchwork: The composer of plots. https://CRAN.Rproject.org/package=patchwork
Popoviciu, T. (1935). Sur les équations algébriques ayant toutes leurs racines réelles. Mathematica Cluj, 9(129145).
Spitzer, R. L., Kroenke, K., Williams, J. B., Group, P. H. Q. P. C. S., Group, P. H. Q. P. C. S., et al. (1999). Validation and utility of a selfreport version of PRIMEMD: The PHQ primary care study. JAMA, 282(18), 1737–1744. https://doi.org/10.1001/jama.282.18.1737
Wickham, H. (2022). tidyverse: Easily install and load the ’tidyverse’. https://CRAN.Rproject.org/package=tidyverse
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wilcox, R. R. (1981). A review of the betabinomial model and its extensions. Journal of Educational Statistics, 6(1), 3–32. https://doi.org/10.2307/1165046
Yee, T. (2022). VGAM: Vector generalized linear and additive models. https://www.stat.auckland.ac.nz/~yee/VGAM/

Or if you’re thinking in terms of OLS, presume the residuals are normally distributed. ↩︎

We use the PHQ9 all the time in the VA (i.e., the US Department of Veterans Affairs) both in clinical services and in research settings. It’s reasonably brief and it does a fair job assessing the primary diagnostic criteria for major depressive disorder. ↩︎

If you increase the
n
of the sample, you’ll see the sample statistics quickly approach the values from Popoviciu’s inequality. ↩︎ 
There’s the whole ‘Fisher liked the normal distribution, so you should too’ bit. Plus we do need defaults and the Gaussian isn’t a bad candidate, all things considered. But yeah, sigh ↩︎

Yes, I know; some is doing a lot of work, here. ↩︎

If you didn’t know, the rate of the exponential distribution is the reciprocal of its mean. ↩︎

In case you’re not in on the joke, behold: https://youtu.be/d6sbPCIEMyI ↩︎

Okay, I admit that was a little snarky. But what are we really doing here? Are we just churning out
\(p\)
values for pubs or does the science matter at all? ↩︎