class: center, middle, inverse, title-slide # Comparing models ### Math 315: Bayesian Statistics --- # Example .large[ **Goal:** investigate the association between smoking and lung capacity using data from 345 adolescents between the ages of 10 and 19. **Wrinkle:** Lung function is expected to increase during adolescence, but smoking may slow it's progression Variable | Description :---------|:--------------- `FEV` | forced expiratory volume (in liters per second) `Age` | age in years `Smoke` | `Smoker` or `Nonsmoker` ] .footnote[ Source: Rosner (2006) ] --- class: clear, middle <img src="22-model-comparison_files/figure-html/unnamed-chunk-1-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Candidate models .large[ Model | Mean function :------:|:---------------- 1 | `\(\mu_i = \beta_0 + \beta_1 {\tt age}_i\)` 2 | `\(\mu_i = \beta_0 + \beta_1 {\tt smoke}_i\)` 3 | `\(\mu_i = \beta_0 + \beta_1 {\tt age}_i + \beta_2 {\tt age}_i\)` 4 | `\(\mu_i = \beta_0 + \beta_1 {\tt age}_i + \beta_2 {\tt age}_i + \beta_2 {\tt age}_i \times {\tt smoke}_i\)` ] --- # k-fold cross validation .large[ 1. split the data into `\(k\)` equally-sized groups 2. set aside group `\(k\)` as a test set; <br> fit model to the other `\(k-1\)` groups 3. make predictions for group `\(k\)` using the fitted model 4. repeat steps 2-3 `\(k\)` times (i.e. `\(k\)` "folds"), each time holding out a different group 5. calculate performance metrics from each validation set 6. average each metric over the `\(k\)` folds to come up with a single estimate of that metric ] --- # k-fold cross validation <img src="figs/k-folds.png" width="1613" /> .footnote[Image courtesy of Dennis Sun] --- # k-fold cross validation .large[ - Usually `\(k=5\)` or 10 - `\(k=n\)` is leave-one-out cross validation - You can use either the posterior mean or median for the predicted value, `\(\widehat{Y}_i\)` - Predictive performance metrics: + Bias + Mean squared error (MSE) + Mean absolute deviation (MAD) + Coverage of prediction intervals (PIs) + Average width of PIs - Overall fit: + Log score ] --- # k-fold cross validation .large[ | | m1| m2| m3| m4| |:-----|-----:|------:|-----:|-----:| |bias | 0.000| -1.950| 0.001| 0.006| |mse | 0.468| 4.415| 0.463| 0.458| |mad | 0.538| 1.950| 0.542| 0.535| |cov | 0.229| 0.212| 0.930| 0.930| |width | 2.645| 2.645| 2.635| 2.594| <br> .content-box-yellow[ Assumption: the "true" model likely produces better out-of-sample predictions than competing models ] ] --- # k-fold cross validation .large[ **Advantages:** Simple, intuitive, and broadly applicable **Disadvantages:** Slow because it requires several model fits and it is hard to say a difference is statistically significant ] --- # Information criteria .large[ - Several information criteria have been proposed that do not require fitting the model several times - All criteria attempt to approximate the out-of-sample predictive accuracy in some way - Many are functions of the deviance + Bayesian information criteria (BIC) + Deviance information criteria (DIC) + Widely applicable information criteria (WAIC) ] --- # Kullback-Leibler Divergence .large[ Target distribution: `\(p(x)\)` Approximation: `\(q(x)\)` Let `\(p(x)\)` and `\(q(x)\)` be two PMFs, then `$$D_{KL}(p, q) = \sum_i p_i \left[ \log p_i - \log q_i \right].$$` If `\(p(x)\)` and `\(q(x)\)` are PDFs, then `$$D_{KL}(p, q) = \int_{- \infty}^\infty p_i \left[ \log p_i - \log q_i \right]dx.$$` .content-box-yellow[Measure of how much information is lost due to the approx] ] --- # Kullback-Leibler Divergence .large[ But we don't know `\(p(x)\)`, the true distribution! This isn't necessary for model comparisons Let `\(q(x)\)` and `\(r(x)\)` denote two competing models (distributions). ] `\begin{align*} &D_{KL}(p, q) - D_{KL}(p, r)\\ &=\sum_i p(x_i) \left[ \log p(x_i) - \log q(x_i) \right] - \sum_i p(x_i) \left[ \log p(x_i) - \log r(x_i) \right]\\ &= {\rm E}\left[ \log q(x_i) \right] - {\rm E}\left[ \log r(x_i) \right] \end{align*}` .large[ Measures the difference in average log probabilities ] --- # Deviance .large[ `$$D(q) = -2 \sum_i \log(q_i) = -2 \times \text{log likelihood}$$` The relative K-L divergence can be expressed in terms of deviance: `\begin{align*} D_{KL}(p, q) - D_{KL}(p, r) &= \frac{1}{n} \sum_i \log(q_i) - \frac{1}{n}\sum_i \log(r_i)\\ &= - \frac{1}{2n} \left[ D(q) - D(r) \right] \end{align*}` Thus, we can compare deviances rather than K-L divergences ] --- # BIC .large[ `$${\rm BIC} = \underset{\text{within sample deviance}}{\underbrace{-2 \log \left[ f(Y | \widehat{\theta}_{ML}) \right]}} + \underset{\text{penalty term}}{\underbrace{p\log(n)}}$$` <br> Only reliable when - Using flat priors or priors are overwhelmed by the likelihood - Posterior distribution is approximately multivariate normal - `\(n >> p\)` Can't account for informative priors or more complex models ] --- # DIC .large[ `\({\rm DIC} = \underset{\text{expected deviance}}{\underbrace{ \mathrm{E} \left[ -2 \log\left( f(Y| \theta) \right) \right]}} + p_D\)` - `\(p_D = {\rm E} \left[ -2 \log\left( f(Y| \theta) \right) \right] - \left[ -2 \log\left( f(Y| \widehat{\theta}) \right) \right]\)`, `\(\widehat{\theta} ={\rm E}(\theta | Y)\)` - `\(p_D =\)` effective number of parameters Computational formula: - `\({\rm DIC} = \overline{D} + (\overline{D} - \widehat{D}) = 2\overline{D} - \widehat{D}\)` - deviance has a posterior distribution, it's a function of `\(\theta\)` ] --- # DIC .large[ - Models with small `\(\overline{D}\)` fit the data well - Models with small `\(p_D\)` are simple - We prefer models that are simple and fit well `\(\Rightarrow\)` smallest DIC - Effective number of parameters + If you have uninformative priors, `\(p_D \approx p\)` + With strong priors `\(p_D << p\)` ] --- # DIC .large[ - Rule of thumb: a difference of DIC of less than 5 is not definitive and a difference greater than 10 is substantial - `dic.samples()` function in **coda** R package - DIC can only be used to compare models *with the same likelihood* - Still requires + Posterior distribution is approximately MVN + `\(n >> p\)` ] --- # WAIC .large[ - Widely application IC (or Watanabe-Akaike IC) - BIC and DIC condition on point estimates of `\(\theta\)` - A better idea is be fully Bayesian and average over the posterior! - Let `\(m_i\)` be the posterior mean of `\(f(Y_i | \theta)\)` - Let `\(v_i\)` be the posterior variance of `\(\log[f(Y_i | \theta)]\)` - The effective model size is `\(p_W = \sum v_i\)` `$${\rm WAIC} =-2 \sum_{i=1}^n m_i + 2p_W$$` ] --- # WAIC .large[ - Smaller WAIC is preferred - DIC and WAIC often give similar results, but WAIC is arguably more theoretically justified - Motivated as an approximation to leave-one-out CV - Not implemented in **coda**, but reasonable to compute "manually" ] --- ```r m4 <- "model{ for(i in 1:n){ Y[i] ~ dnorm(mu[i], tau) mu[i] <- inprod(X[i,], beta[]) * like[i] <- dnorm(Y[i], mu[i], tau) } for(j in 1:4) { beta[j] ~ dnorm(0,0.001) } tau ~ dgamma(0.1, 0.1) }" d4 <- list(Y = Y, X = X, n = nrow(X)) model4 <- jags.model(textConnection(m4), data = d4, quiet = TRUE) update(model4, 1000, progress.bar="none") draws4 <- coda.samples(model4, variable.names = "like", n.iter = 20000, progress.bar = "none") # Compute WAIC *like <- draws4[[1]] *fbar <- colMeans(like) *pw <- sum(apply(log(like), 2, var)) *waic <- -2 * sum(log(fbar)) + 2 * pw ``` --- # Example .large[ Model | Variables | WAIC ------|-----------|------- 1 | `age` | 712.8 2 | `smoke` |713 3 | `age`, `smoke` |712.5 4 | `age`, `smoke`, `age:smoke` |703.4 <br> .content-box-yellow[Which model is preferred?] ]