
Bayesian Model Comparison
Overview
In this tutorial, we will compare two True and Error model variants using the Bayes factor. One model variant imposes no restrictions on the error probability parameters, whereas the other model constrains the error probabilities to be equal. Computing the Bayes factor is challenging because it requires integrating over a potentially high dimensional parameter space. To compute Bayes factors, we will use a robust method called non-reversible parallel tempering (Bouchard-Côté et al., 2022) using the Julia package Pigeons.jl.
Bayes Factor
Before proceeding to the code, we provide a brief overview of the Bayes factor. Readers who are familiar with Bayes factors can skip this section. In Bayesian model comparison, the Bayes factor allows one to compare the probability of the data under two different models while taking into account model flexibility stemming all sources, including the number of parameters, functional form, and prior distribution. Thus, it provides a way to balance model fit and model flexibility into a single index. One important fact to keep in mind is that Bayes factors can be sensitive to the choice prior distributions over parameters. Sensitivity to prior distributions over parameters might be desireable depending on one's goals and knowledge of the models under consideration.
The Bayes factor is the likelihood of the data $\mathbf{Y} = \left[y_1,y_2, \dots, y_n\right]$ under model $\mathcal{M}_i$ vs. model $\mathcal{M}_j$. The relationship between the Bayes Factor and the posterior of odds of $\mathcal{M}_i$ vs. $\mathcal{M}_j$ can be stated as:
$\frac{\pi(\mathcal{M}_i \mid \mathbf{Y})}{\pi(\mathcal{M}_j \mid \mathbf{Y})} = \frac{\pi(\mathcal{M}_i)}{\pi(\mathcal{M}_j)} \mathrm{BF}_{i,j}.$
The term on the left hand side is the posterior odds of $\mathcal{M}_i$ vs. $\mathcal{M}_j$, $\pi$ is the posterior probability, the first term on the right hand side is the prior odds of $\mathcal{M}_i$ vs. $\mathcal{M}_j$, and $\mathrm{BF}_{i,j}$ is the Bayes factor for $\mathcal{M}_i$ vs. $\mathcal{M}_j$. In the equation above, $\mathrm{BF}_{i,j}$ functions as a conversion factor between prior odds and posterior odds. Thus, the Bayes factor is as the factor by which prior odds must be updated in light of the data. This interpretation is important because demonstrates that the prior odds should be updated by the same factor even if there is disagreement over the prior odds. The Bayes factor can also be written as the ratio of marginal likelihoods as follows:
$\mathrm{BF}_{i,j} = \frac{f(\mathbf{Y} \mid \mathcal{M}_i)}{f(\mathbf{Y} \mid \mathcal{M}_j)}$,
where $f$ is the likelihood function of $\mathcal{M}_i$, and the marginal likelihood of $\mathcal{M}_i$ is given by:
$f(\mathbf{Y} \mid \mathcal{M}_i) = \int_{\boldsymbol{\theta}\in \boldsymbol{\Theta}_i} f(\mathbf{Y} \mid \boldsymbol{\theta}, \mathcal{M}_i) \pi(\boldsymbol{\theta} \mid \mathcal{M}_i) d \boldsymbol{\theta}$.
In the equation above, $\boldsymbol{\Theta}_i$ is the parameter space for $\mathcal{M}_i$ and $\boldsymbol{\theta} \in \boldsymbol{\Theta}$ is a vector of parameters. Under this interpretation, the marginal likelihood represents its average prior predictive ability of of $\mathcal{M}_i$. One benefit of the Bayes factor is that the marginal likelihood accounts for model flexibility because the density of the prior distribution must be "rationed" across the parameter space (i.e., must integrate to 1). Consequentially, the predictions of a model with a diffuse distribution in a high dimensional parameter space will be penalized due to its low prior density.
Load Packages
Before proceeding, we will load the required packages.
using MCMCChains
using Pigeons
using Random
using TrueAndErrorModels
using Turing
Define Models
The following code blocks define the models along with their prior distributions using Turing.jl. Notice that the models are identical except constraints imposed on the error probabilities $\epsilon_i$.
TET4 Model
The TET4 model is described in detail on the page called model overview. The 4 in TET4 refers to the number of error probability parameters, which are described below. The TET4 model has four true preference state parameters which collectively form the joint probability distribution over preference states $R_1,R_2$, $R_1,S_2, $S_1,R_2$, $S_1S_2$, where R represents a true preference for the risky option, S represents a true preference for the safe option, and the subscript corresponds to choice set. The joint preference states are:
- $p_{\mathrm{R_1R_2}}$: the probability of prefering the risky option in both choice sets
- $p_{\mathrm{R_1S_2}}$: the probability of prefering the risky option in the first choice set and prefering the safe option in the second choice set
- $p_{\mathrm{S_1R_2}}$: the probability of prefering the safe option in the first choice set and prefering the risky option in the second choice set
- $p_{\mathrm{S_1S_2}}$: the probability of prefering the safe option in both choice sets
subject to the constraint that $p_{\mathrm{R_1R_2}} + p_{\mathrm{R_1S_2}} + p_{\mathrm{S_1R_2}} + p_{\mathrm{S_1S_2}} = 1$, and $p_i \geq 0, \forall i$. As its namesake implies, the TET4 model also has four error probability parameters:
- $\epsilon_{\mathrm{S}_1}$: the error probability of selecting $\mathcal{S}_1$ given that $\mathcal{R}_1$ is prefered.
- $\epsilon_{\mathrm{S}_2}$: the error probability of selecting $\mathcal{S}_2$ given that $\mathcal{R}_2$ is prefered.
- $\epsilon_{\mathrm{R}_1}$: the error probability of selecting $\mathcal{R}_1$ given that $\mathcal{S}_1$ is prefered.
- $\epsilon_{\mathrm{R}_2}$: the error probability of selecting $\mathcal{R}_2$ given that $\mathcal{S}_2$ is prefered.
The only constraint is that $\epsilon_i \in [0, .50],\forall i$. The TET4 model is automatically loaded when Turing is loaded into your Julia session. The tet4_model
function accepts a vector of response frequencies. The prior distributions are as follows:
$\mathbf{p} \sim \mathrm{Dirichlet}([1,1,1,1])$
$\boldsymbol{\epsilon} \sim \mathrm{Uniform}(0, .5)$
where $\mathbf{p}$ is a vector of four preference state parameters, and $\boldsymbol{\epsilon}$ is a vector of error probabilities.
TET1 Model
As the name implies, the TET1 model constrains all error probability parameters to be equal:
$\epsilon = \epsilon_{\mathrm{S}_1} = \epsilon_{\mathrm{S}_2} = \epsilon_{\mathrm{R}_1} =\epsilon_{\mathrm{R}_2}$
Otherwise, TET1 and TET4 are identical. The TET1 model is also automatically loaded when Turing is loaded into your Julia session. The tet1_model
function accepts a vector of response frequencies, and using the following prior distributions over the parameters:
$\mathbf{p} \sim \mathrm{Dirichlet}([1,1,1,1])$
$\epsilon \sim \mathrm{Uniform}(0, .5),$
where $\mathbf{p}$ is a vector of four preference state parameters, and error probability $\epsilon$ is a scalar.
Data-Generating Model
In our demonstration, we will use the TET1 as the data-generating model. In the code block below, we will create a model object and generate 2 simulated responses from all 100 simulated subjects for a total of 200 responses. In this model, we assume that the probability of a true preference state RR is relatively high, and the probability of other preference states decreases as they become more difference from RR:
- $p_{\mathrm{R_1R_2}} = .65$
- $p_{\mathrm{R_1S_2}} = .15$
- $p_{\mathrm{S_1R_2}} = .15$
- $p_{\mathrm{S_1S_2}} = .05$
In addition, our model assumes the error probabilities are constrained to be equal:
$\epsilon_{\mathrm{S}_1} = \epsilon_{\mathrm{S}_S} = \epsilon_{\mathrm{R}_1} =\epsilon_{\mathrm{R}_2} = .10$
Random.seed!(258)
dist = TrueErrorModel(; p = [0.65, .15, .15, .05], ϵ = fill(.10, 4))
data = rand(dist, 200)
Estimate Marginal Log Likelihood
The next step is to run the pigeons
function to estimate the marginal log likelihood for each model.
TET4
The code block below estimates the marginal log likelihood of the the TET4 model. This involves passing the tet4_model
to the function pigeons
along with the vector of response frequencies data
.
pt_tet4 = pigeons(target=TuringLogPotential(tet4_model(data)), record=[traces])
────────────────────────────────────────────────────────────────────────────
scans Λ log(Z₁/Z₀) min(α) mean(α) min(αₑ) mean(αₑ)
────────── ────────── ────────── ────────── ────────── ────────── ──────────
2 3.22 -47.6 0.000923 0.643 1 1
4 1.86 -39.9 0.265 0.793 1 1
8 3.6 -38.2 0.255 0.6 1 1
16 3.2 -39.2 0.403 0.645 1 1
32 3.51 -38.8 0.36 0.61 1 1
64 3.56 -39.6 0.441 0.605 1 1
128 3.78 -40.1 0.488 0.58 1 1
256 3.63 -39.4 0.482 0.596 1 1
512 3.61 -39.5 0.556 0.599 1 1
1.02e+03 3.56 -39.2 0.577 0.604 1 1
────────────────────────────────────────────────────────────────────────────
Below, we will change the numerical indices to more descriptive indices for ease of interpretation. The next line of code converts the output to an Chain
object.
name_map = Dict(
"p[1]" => "pᵣᵣ",
"p[2]" => "pᵣₛ",
"p[3]" => "pₛᵣ",
"p[4]" => "pₛₛ",
"ϵ[1]" => "ϵₛ₁",
"ϵ[2]" => "ϵₛ₂",
"ϵ[3]" => "ϵᵣ₁",
"ϵ[4]" => "ϵᵣ₂",
)
chain_te4 = Chains(pt_tet4)
chain_te4 = replacenames(chain_te4, name_map)
A summary of the MCMCChain is provided below.
Chains MCMC chain (1024×9×1 Array{Float64, 3}):
Iterations = 1:1:1024
Number of chains = 1
Samples per chain = 1024
parameters = pᵣᵣ, pᵣₛ, pₛᵣ, pₛₛ, ϵₛ₁, ϵₛ₂, ϵᵣ₁, ϵᵣ₂
internals = log_density
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
pᵣᵣ 0.5768 0.0824 0.0039 449.2267 570.8385 1.0027 missing
pᵣₛ 0.1820 0.0662 0.0033 391.3331 750.0271 1.0000 missing
pₛᵣ 0.1787 0.0579 0.0026 523.1174 740.9073 1.0002 missing
pₛₛ 0.0625 0.0297 0.0012 618.7097 755.8075 0.9995 missing
ϵₛ₁ 0.0517 0.0314 0.0014 529.0317 866.9172 0.9995 missing
ϵₛ₂ 0.0571 0.0342 0.0017 418.1905 657.3602 1.0010 missing
ϵᵣ₁ 0.1995 0.1077 0.0048 514.3591 868.8844 1.0006 missing
ϵᵣ₂ 0.2706 0.1235 0.0065 372.7194 763.7186 1.0005 missing
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
pᵣᵣ 0.4293 0.5165 0.5762 0.6383 0.7335
pᵣₛ 0.0712 0.1317 0.1765 0.2275 0.3180
pₛᵣ 0.0820 0.1335 0.1760 0.2202 0.2987
pₛₛ 0.0179 0.0411 0.0576 0.0808 0.1294
ϵₛ₁ 0.0033 0.0245 0.0511 0.0754 0.1104
ϵₛ₂ 0.0033 0.0282 0.0566 0.0840 0.1238
ϵᵣ₁ 0.0162 0.1078 0.2056 0.2830 0.3887
ϵᵣ₂ 0.0232 0.1705 0.2894 0.3699 0.4639
TE1
As we did above, we will estimate the marginal log likelihood by passing tet1_model
to the functionpigeons
.
pt_tet1 = pigeons(target=TuringLogPotential(te1_model(data)), record=[traces])
────────────────────────────────────────────────────────────────────────────
scans Λ log(Z₁/Z₀) min(α) mean(α) min(αₑ) mean(αₑ)
────────── ────────── ────────── ────────── ────────── ────────── ──────────
2 3.18 -69.3 1.04e-16 0.647 1 1
4 2.11 -41.9 0.00298 0.766 1 1
8 3.41 -39.2 0.226 0.621 1 1
16 2.96 -38.6 0.364 0.671 1 1
32 3.71 -37.6 0.459 0.588 1 1
64 3.55 -38.3 0.505 0.605 1 1
128 3.42 -38 0.487 0.62 1 1
256 3.48 -38.1 0.556 0.613 1 1
512 3.28 -37.7 0.593 0.635 1 1
1.02e+03 3.41 -38 0.578 0.621 1 1
────────────────────────────────────────────────────────────────────────────
In the code block below, we will rename the parameters, and convert the output to an Chain
object
name_map = Dict(
"p[1]" => "pᵣᵣ",
"p[2]" => "pᵣₛ",
"p[3]" => "pₛᵣ",
"p[4]" => "pₛₛ",
)
chain_te1 = Chains(pt_tet1)
chain_te1 = replacenames(chain_te1, name_map)
The output below provides a summary of the chain.
Chains MCMC chain (1024×6×1 Array{Float64, 3}):
Iterations = 1:1:1024
Number of chains = 1
Samples per chain = 1024
parameters = pᵣᵣ, pᵣₛ, pₛᵣ, pₛₛ, ϵ
internals = log_density
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
pᵣᵣ 0.7077 0.0351 0.0011 1106.2442 1055.5828 1.0000 missing
pᵣₛ 0.1178 0.0261 0.0009 908.2753 965.3276 1.0009 missing
pₛᵣ 0.1408 0.0268 0.0008 1036.7903 867.0191 0.9992 missing
pₛₛ 0.0338 0.0140 0.0005 891.6153 1059.1575 1.0031 missing
ϵ 0.0874 0.0115 0.0004 952.5670 803.6060 0.9995 missing
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
pᵣᵣ 0.6364 0.6843 0.7065 0.7329 0.7712
pᵣₛ 0.0717 0.0993 0.1165 0.1344 0.1746
pₛᵣ 0.0923 0.1213 0.1402 0.1589 0.1964
pₛₛ 0.0120 0.0235 0.0318 0.0422 0.0662
ϵ 0.0668 0.0795 0.0867 0.0947 0.1114
Extract marginal log likelihood
In the following code block, the function stepping_stone
extracts that marginal log likelihood for each model:
mll_tet1 = stepping_stone(pt_tet1)
mll_tet4 = stepping_stone(pt_tet4)
Compute the Bayes Factor
The bayes factor is obtained by exponentiating the difference between marginal log likelihoods. Recall that TET1 was the data-generating model. As expected, the value of 3.39
indicates that the data are 3.39
times more likely under the data-generating model, TET1, than TET4.
bf = exp(mll_tet1 - mll_tet4)
3.3948019100884617
References
Birnbaum, M. H., & Quispe-Torreblanca, E. G. (2018). TEMAP2. R: True and error model analysis program in R. Judgment and Decision Making, 13(5), 428-440.
Lee, M. D. (2018). Bayesian methods for analyzing true-and-error models. Judgment and Decision making, 13(6), 622-635.
Syed, S., Bouchard-Côté, A., Deligiannidis, G., & Doucet, A. (2022). Non-reversible parallel tempering: a scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2), 321-350.