MCMC diagnostics

class: center, middle, inverse, title-slide

# MCMC diagnostics
### Math 315: Bayesian Statistics

---

# Tuning MCMC

.large[
Three main decisions:

- Selecting the initial values

- Determining if/when the chain(s) has converged

- Selecting the number of samples needed to approximate
]

---

# Initial values

.large[
- The algorithm will eventually converge no matter what
initial values you select

- Choosing good starting values will speed up convergence

- It is important to try a few initial values to verify they all give
the same result

- Usually 3-5 separate chains is sufficient

.content-box-yellow[
**Common strategies**:

1. Select good initial values using method of
moments or MLE

2. Purposely pick bad but different initial values for
each chain to check convergence
]
]

---

# Convergence diagnostics

.large[

.content-box-yellow[
Don't assume that your Markov chain converged to the desired stationary distribution just because "it worked."
]

A **"good" Markov chain** should be:

- **Stationary**: draws should within the posterior (stable path around the center of gravity)

- **Well mixed**:

+ successive draws should not be highly correlated
    
    + each chain should target the same stationary distribution

]

---

# Convergence diagnostics
.large[
The **coda** R package has dozens of diagnostics

```r
library(coda)
```

Did my chains converge?

- Geweke

- Gelman-Rubin

Did I run the sampler long enough after convergence?

- Effective sample size

- Standard errors for the posterior mean estimate
]

---
class: inverse

# Your turn

.large[.bold[With a neighbor, decide whether each chain is well mixed]]

---
class: inverse

# Your turn

.large[.bold[With a neighbor, decide whether each chain converged]]

---

# Geweke diagnostic

.large[
- **Idea**: a two-sample test comparing the mean of a chain between batches at the beginning versus the end

- By default, JAGS compares the first 10% with the last 50%

- Test statistic is a Z-score with the standard errors
adjusted for autocorrelation
]

---

# Geweke diagnostic

.pull-left[

```r
geweke.diag(bad_samples[[1]])
```

```
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##  mu[1]  mu[2] 
##  1.230 -1.194
```

]

.pull-right[

```r
geweke.diag(good_samples[[1]])
```

```
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##   mu[1]   mu[2] 
## -0.5572  2.5130
```

]

---

# Gelman-Rubin diagnostic
  
.large[
- Effective convergence of Markov chain simulation has
been reached when inferences for quantities of interest do not depend
on the starting point of the simulations.

- If we run multiple chains, we hope that all chains give same result

- Gelman and Rubin (1992) proposed (essentially) an ANOVA test of whether the chains have the same mean

- `$R_j$` is scaled and approaches 1 from above

+ `$R_j = 1 \Rightarrow$` perfect convergence

+ `$R_j \ge 1.1 \Rightarrow$` red flag
]

---

# Gelman-Rubin diagnostic

.pull-left[
<img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-7-1.svg" width="85%" />

```r
gelman.diag(mu1_bad)
```

```
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## [1,]       1.14       1.43
```

]

.pull-right[

```r
gelman.diag(mu1_good)
```

```
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## [1,]          1          1
```

]

---

# Autocorrelation

.large[
**Definition:**  the correlation between samples `$h$` iterations apart

`$$\rho(h) = {\rm Corr}(\theta_j^{(s)}, \theta_j^{(s-h)})$$`

**So what?**<br> A chain with high correlation is said to exhibit **poor mixing**

]

---

# Autocorrelation

.pull-left[
<img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-11-1.svg" width="80%" />

]

.pull-right[
<img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-12-1.svg" width="80%" />
]

---

# Autocorrelation

.large[
- Lower values are better, but if the chains are long enough
even large values can be OK

- **Thinning** the Markov chain means keeping only every kth draw, where k is chosen so that the autocorrelation is small

- `thin` argument to `coda.samples()` implements this

- This is always less efficient than using all samples, but can
save memory
]

---

# Effective sample size

.large[
- Highly correlated samples have less information than
independent samples

- `$S=$` \# MCMC samples after burn in

- **Effective samples size (ESS)**

`$$ESS_k = S \Big/ \left(1 + 2 \displaystyle \sum_{k=1}^\infty \rho(k)\right)$$`

- ESS = "equivalent number of independent observations"

- Should be at least a few thousand for all parameters
]

---

# Effective sample size

.pull-left[
<img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-13-1.svg" width="90%" />

```r
# ESS for single chain of mu1
# n.iter = 5000
effectiveSize(mu1_bad[[1]])  
```

```
##     var1 
## 5.590956
```

]

.pull-right[
<img src="19-mcmc-diagnostics_files/figure-html/unnamed-chunk-15-1.svg" width="90%" />

```r
# ESS for single chain of mu1
# n.iter = 5000
effectiveSize(mu1_good[[1]])  
```

```
##   var1 
## 2314.2
```
]

---

# Standard errors of posterior mean estimates

.large[
- Assuming independence the standard error is

`$$\text{Naive SE} = \dfrac{s}{\sqrt{S}}$$`

where `$s =$` sample SD
    
- A more realistic standard error is

`$$\text{Time-series SE} = \dfrac{s}{\sqrt{ESS}}$$`

-  If the SE is too large, rerun the MCMC algorithm for a larger number of samples
]

---

# Standard errors of posterior mean estimates

```r
summary(bad_samples)  
```

```
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean   SD Naive SE Time-series SE
## mu[1] -4.277 24.9   0.2033          6.092
## mu[2]  3.710 24.9   0.2033          6.011
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%    50%   75% 97.5%
## mu[1] -54.28 -21.28 -4.701 13.19 42.02
## mu[2] -42.53 -13.77  4.013 20.86 53.84
```

---

# Standard errors of posterior mean estimates

```r
summary(good_samples)    
```

```
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean     SD Naive SE Time-series SE
## mu[1] -0.5664 1.2521 0.010224       0.015244
## mu[2]  2.2494 0.3239 0.002644       0.003461
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%     50%   75% 97.5%
## mu[1] -3.539 -1.234 -0.3642 0.320 1.292
## mu[2]  1.572  2.040  2.2682 2.473 2.839
```

---

# What to do if the chains haven't converged?

.large[
- Increase the number of iterations

- Tune the M-H candidate distribution

- Use better initial values

- Use a more advanced algorithm

- Simplify/reparameterize the model

- Use (more) informative priors

- Run a simulation study

]

---

# What to do for massive data sets?

.large[
- MAP estimation

- Bayesian CLT

- Variational Bayesian computing: Approximates the posterior by assuming the posterior is independent across parameters (fast, but questionable statistical properties)

-  Parallel computing: MCMC is inherently sequential, but often some steps can be done in parallel, e.g., onerous likelihood computations

- Divide and Conquer: Split the data into batches and
analysis them in parallel, and then carefully combine the
result of the batch analyses
]