class: center, middle, inverse, title-slide # MCMC diagnostics ### Math 315: Bayesian Statistics --- # Tuning MCMC .large[ Three main decisions: - Selecting the initial values - Determining if/when the chain(s) has converged - Selecting the number of samples needed to approximate ] --- # Initial values .large[ - The algorithm will eventually converge no matter what initial values you select - Choosing good starting values will speed up convergence - It is important to try a few initial values to verify they all give the same result - Usually 3-5 separate chains is sufficient .content-box-yellow[ **Common strategies**: 1. Select good initial values using method of moments or MLE 2. Purposely pick bad but different initial values for each chain to check convergence ] ] --- # Convergence diagnostics .large[ .content-box-yellow[ Don't assume that your Markov chain converged to the desired stationary distribution just because "it worked." ] A **"good" Markov chain** should be: - **Stationary**: draws should within the posterior (stable path around the center of gravity) - **Well mixed**: + successive draws should not be highly correlated + each chain should target the same stationary distribution ] --- # Convergence diagnostics .large[ The **coda** R package has dozens of diagnostics ```r library(coda) ``` Did my chains converge? - Geweke - Gelman-Rubin Did I run the sampler long enough after convergence? - Effective sample size - Standard errors for the posterior mean estimate ] --- class: inverse # Your turn .large[.bold[With a neighbor, decide whether each chain is well mixed]] <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-2-1.svg" width="100%" /> --- class: inverse # Your turn .large[.bold[With a neighbor, decide whether each chain converged]] <img src="19-mcmc-diagnostics_files/figure-html/yourturn-chain1-1.svg" width="100%" /> --- # Geweke diagnostic .large[ - **Idea**: a two-sample test comparing the mean of a chain between batches at the beginning versus the end - By default, JAGS compares the first 10% with the last 50% - Test statistic is a Z-score with the standard errors adjusted for autocorrelation ] <img src="19-mcmc-diagnostics_files/figure-html/yourturn-chain1-1.svg" width="95%" /> --- # Geweke diagnostic .pull-left[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-3-1.svg" width="85%" /> ```r geweke.diag(bad_samples[[1]]) ``` ``` ## ## Fraction in 1st window = 0.1 ## Fraction in 2nd window = 0.5 ## ## mu[1] mu[2] ## 1.230 -1.194 ``` ] .pull-right[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-5-1.svg" width="85%" /> ```r geweke.diag(good_samples[[1]]) ``` ``` ## ## Fraction in 1st window = 0.1 ## Fraction in 2nd window = 0.5 ## ## mu[1] mu[2] ## -0.5572 2.5130 ``` ] --- # Gelman-Rubin diagnostic .large[ - Effective convergence of Markov chain simulation has been reached when inferences for quantities of interest do not depend on the starting point of the simulations. - If we run multiple chains, we hope that all chains give same result - Gelman and Rubin (1992) proposed (essentially) an ANOVA test of whether the chains have the same mean - `\(R_j\)` is scaled and approaches 1 from above + `\(R_j = 1 \Rightarrow\)` perfect convergence + `\(R_j \ge 1.1 \Rightarrow\)` red flag ] --- # Gelman-Rubin diagnostic .pull-left[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-7-1.svg" width="85%" /> ```r gelman.diag(mu1_bad) ``` ``` ## Potential scale reduction factors: ## ## Point est. Upper C.I. ## [1,] 1.14 1.43 ``` ] .pull-right[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-9-1.svg" width="85%" /> ```r gelman.diag(mu1_good) ``` ``` ## Potential scale reduction factors: ## ## Point est. Upper C.I. ## [1,] 1 1 ``` ] --- # Autocorrelation .large[ **Definition:** the correlation between samples `\(h\)` iterations apart `$$\rho(h) = {\rm Corr}(\theta_j^{(s)}, \theta_j^{(s-h)})$$` **So what?**<br> A chain with high correlation is said to exhibit **poor mixing** ] --- # Autocorrelation .pull-left[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-11-1.svg" width="80%" /> ] .pull-right[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-12-1.svg" width="80%" /> ] --- # Autocorrelation .large[ - Lower values are better, but if the chains are long enough even large values can be OK - **Thinning** the Markov chain means keeping only every kth draw, where k is chosen so that the autocorrelation is small - `thin` argument to `coda.samples()` implements this - This is always less efficient than using all samples, but can save memory ] --- # Effective sample size .large[ - Highly correlated samples have less information than independent samples - `\(S=\)` \# MCMC samples after burn in - **Effective samples size (ESS)** `$$ESS_k = S \Big/ \left(1 + 2 \displaystyle \sum_{k=1}^\infty \rho(k)\right)$$` - ESS = "equivalent number of independent observations" - Should be at least a few thousand for all parameters ] --- # Effective sample size .pull-left[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-13-1.svg" width="90%" /> ```r # ESS for single chain of mu1 # n.iter = 5000 effectiveSize(mu1_bad[[1]]) ``` ``` ## var1 ## 5.590956 ``` ] .pull-right[ <img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-15-1.svg" width="90%" /> ```r # ESS for single chain of mu1 # n.iter = 5000 effectiveSize(mu1_good[[1]]) ``` ``` ## var1 ## 2314.2 ``` ] --- # Standard errors of posterior mean estimates .large[ - Assuming independence the standard error is `$$\text{Naive SE} = \dfrac{s}{\sqrt{S}}$$` where `\(s =\)` sample SD - A more realistic standard error is `$$\text{Time-series SE} = \dfrac{s}{\sqrt{ESS}}$$` - If the SE is too large, rerun the MCMC algorithm for a larger number of samples ] --- # Standard errors of posterior mean estimates ```r summary(bad_samples) ``` ``` ## ## Iterations = 2001:7000 ## Thinning interval = 1 ## Number of chains = 3 ## Sample size per chain = 5000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## mu[1] -4.277 24.9 0.2033 6.092 ## mu[2] 3.710 24.9 0.2033 6.011 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## mu[1] -54.28 -21.28 -4.701 13.19 42.02 ## mu[2] -42.53 -13.77 4.013 20.86 53.84 ``` --- # Standard errors of posterior mean estimates ```r summary(good_samples) ``` ``` ## ## Iterations = 2001:7000 ## Thinning interval = 1 ## Number of chains = 3 ## Sample size per chain = 5000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## mu[1] -0.5664 1.2521 0.010224 0.015244 ## mu[2] 2.2494 0.3239 0.002644 0.003461 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## mu[1] -3.539 -1.234 -0.3642 0.320 1.292 ## mu[2] 1.572 2.040 2.2682 2.473 2.839 ``` --- # What to do if the chains haven't converged? .large[ - Increase the number of iterations - Tune the M-H candidate distribution - Use better initial values - Use a more advanced algorithm - Simplify/reparameterize the model - Use (more) informative priors - Run a simulation study ] --- # What to do for massive data sets? .large[ - MAP estimation - Bayesian CLT - Variational Bayesian computing: Approximates the posterior by assuming the posterior is independent across parameters (fast, but questionable statistical properties) - Parallel computing: MCMC is inherently sequential, but often some steps can be done in parallel, e.g., onerous likelihood computations - Divide and Conquer: Split the data into batches and analysis them in parallel, and then carefully combine the result of the batch analyses ]