Bayesian regression: Choosing priors

class: center, middle, inverse, title-slide

# Bayesian regression: Choosing priors
### Math 315: Bayesian Statistics

---

class: middle, center, inverse

# Informative priors

---

# Example: Modeling lung capacity

.large[

**Goal:** investigate the association between smoking and lung capacity using data from 345 adolescents between the ages of 10 and 19.

**Wrinkle:** Lung function is expected to increase during adolescence, but smoking may slow it's progression

Variable | Description
:---------|:---------------
`FEV`    | forced expiratory volume (in liters per second)
`Age`    | age in years
`Smoke`  | `Smoker` or `Nonsmoker`

]

.footnote[
Source: Rosner (2006) 
]

---

# Example: Modeling lung capacity

.large[
Based on domain expertise, the following regression model was proposed:

`\begin{align*}
Y_i | X_i &\sim {\rm MVN}(\mu_i, {\bf I}_n 1/\tau)\\
\mu_i &= \beta_0 + \beta_1 {\tt Age} + \beta_2 {\tt Smoke} + \beta_3 {\tt Age \times Smoke}
\end{align*}`

.content-box-yellow[
How can we set informative priors on `$\beta_j$` and `$\tau$`?
]
]

---

# Elicitation for regression coefficients

.large[
Let `$p$` the number of regression coefficients.

Note: `$\widehat{\mu_i} =\tilde{X}\beta$` and `$\tilde{X}\beta \sim {\rm MVN}(\tilde{Y}, D(\tilde{w}))$`, where `$D(\tilde{w})$` is a diagonal matrix.

1. Pick `$p$` combinations of your predictors that your domain expert can assess independently.
    - Ask about average response for each setting
    - Ask to assess probability that the mean is less than some value for each group

1. Setting `$\tilde{Y}$`, `$\tilde{X}$`, `$D(\tilde{w}))$` we can induce a prior of `$\beta$`:

`$$\beta \sim {\rm MVN} (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w})\tilde{X})$$`
]

---

# Prior for regression coefficients

.large[

`$\mu_i = \beta_0 + \beta_1 {\tt Age} + \beta_2 {\tt Smoke} + \beta_3 {\tt Age \times Smoke}$`

We have need priors for `$\beta_0, \beta_1, \beta_2, \beta_3$`

We need elicit information about four combinations of `Age` and `Smoke`

- 11-year-old smokers

- 13-year-old smokers

- 16-year-old nonsmokers

- 18-year-old smokers

.content-box-yellow[
How can we set up the matrix `$\tilde{X}$`?
]

]

---

# Prior for regression coefficients

.large[

`$\mu_i = \beta_0 + \beta_1 {\tt Age} + \beta_2 {\tt Smoke} + \beta_3 {\tt Age \times Smoke}$`

.pull-left[
- 11-year-old smokers

- 13-year-old smokers

- 16-year-old nonsmokers

- 18-year-old smokers

]

.pull-right[

`$$\tilde{\boldsymbol{X}} = \begin{bmatrix} 1& 11 & 0& 0\\
 1& 13& 1& 13\\
 1& 16& 0& 0\\
 1& 18& 1& 18
\end{bmatrix}$$`

]
]

---

# Prior for regression coefficients

.large[
The domain expert...

- expects average FEV among all 18-year-old smokers to be 3.3

- is 99% sure that the mean FEV is less than 4.0 in this group

Linking elicitation to `$\tilde{\mu}_i$` and `$\tilde{w}_i$`

- `$\tilde{\mu}_i = 3.3$`

- Assuming a normal prior, solve `$4.0 = 3.3 + 2.33 \sqrt{\tilde{w}_i}$` 
`$\Rightarrow \tilde{w}_i = (4-3.3)^2 / 2.33^2 = 0.09$`
]

---

# Prior for regression coefficients

.large[
Similar steps are used to elicit information about the other three groups.

.pull-left[
So we have elicited...
]

.pull-right[
`\begin{align*}
\tilde{\mu}_1 &\sim \mathcal{N}(2.8, 0.04)\\
\tilde{\mu}_2 &\sim \mathcal{N}(3.0, 0.04)\\
\tilde{\mu}_3 &\sim \mathcal{N}(4.0, 0.04)\\
\tilde{\mu}_4 &\sim \mathcal{N}(3.3, 0.09)
\end{align*}`
]

Which implies...
]

`$$\tilde{\boldsymbol{X}} \boldsymbol{\beta} =  \begin{bmatrix} \tilde{\mu}_1\\
  \tilde{\mu}_2\\
  \tilde{\mu}_3\\
  \tilde{\mu}_4
  \end{bmatrix}
  \sim {\rm MVN} \left( 
  \begin{bmatrix} 2.8\\
    3.0\\
    4.0\\
    3.3
    \end{bmatrix}, 
    \begin{bmatrix} 0.04 & 0 & 0 & 0\\
    0 & 0.04 & 0 & 0\\
    0 & 0 & 0.04 & 0\\
    0 & 0 & 0 & 0.09
    \end{bmatrix}
  \right)$$`

---

# Prior for regression coefficients

.large[Which implies...]

`$$\boldsymbol{\beta} 
  \sim {\rm MVN} \left( 
  \tilde{\boldsymbol{X}}^{-1} \begin{bmatrix} 2.8\\
    3.0\\
    4.0\\
    3.3
    \end{bmatrix}, 
    \tilde{\boldsymbol{X}}^{-1} \begin{bmatrix} 0.04 & 0 & 0 & 0\\
    0 & 0.04 & 0 & 0\\
    0 & 0 & 0.04 & 0\\
    0 & 0 & 0 & 0.09
    \end{bmatrix} \left(\tilde{\boldsymbol{X}}^{-1} \right)^\prime
  \right)$$`

---

```r
mu <- c(2.8, 3, 4, 3.3)
X.tilde <- matrix(c(1, 11, 0, 0,
 1, 13, 1, 13,
 1, 16, 0, 0,
 1, 18, 1, 18) ,ncol = 4, byrow = TRUE)
beta.mean <- solve(X.tilde) %*% mu
beta.var <- solve(X.tilde) %*% diag(c(0.04, 0.04, 0.04, 0.09)) %*% 
 t(solve(X.tilde))

beta.mean # prior mean vector
```

```
##       [,1]
## [1,]  0.16
## [2,]  0.24
## [3,]  2.06
## [4,] -0.18
```

```r
beta.var  # prior precision matrix
```

```
##         [,1]    [,2]    [,3]    [,4]
## [1,]  0.6032 -0.0432 -0.6032  0.0432
## [2,] -0.0432  0.0032  0.0432 -0.0032
## [3,] -0.6032  0.0432  1.7300 -0.1188
## [4,]  0.0432 -0.0032 -0.1188  0.0084
```

.footnotesize[
JAGS note: Use `dmnorm` to specify the prior with a mean vector and precision matrix.
]

---

# Prior for precision

.large[
To elicit information on `$\tau$`, the researchers asked the domain expert about 18-year-old smokers

1. What do they expect the average FEV to be? [mean]

1. What's the largest observation they would expect? [95th percentile]

1. How certain are they that the largest observation is above 4? [uncertainty for 95th percentile]

Results:

1. `$\tilde{\mu}_0 = 3.3$`

1. Largest observation is 5

1. 95% certain
]

---

# Prior for precision

.large[
Notice that the elicited information is about the response variable, which is assumed to be normal:

- `$\tilde{\mu} = 3.3$`

- `$5 = 3.3 + 1.645 \sigma_0$` `$\Longrightarrow \sigma_0 = 1.7/1.645$` `$\Longrightarrow \tau_0 = (1.645 / 1.7)^2 = 0.94$`

Then, let `$\tau_0$` be the mode of the prior distribution:

$$
0.94 = \frac{a-1}{b} \Longrightarrow a = 0.94b + 1
$$
]

.footnotesize[
You could also let `$\tau_0$` be the mean or median of the prior distribution.
]

---

# Prior for precision
.large[
Domain expert is 95% certain that the 95th percentile is greater than 4:

`\begin{align*}
0.95 &= P(\mu + 1.645 \sigma > 4 | \mu = 3.3)\\
&= P(\mu + 1.645 \sigma > 4 | \mu = 3.3)\\
&= P( 1.645 \sigma > 0.7)\\
&= P( \sigma > 0.7/ 1.645)\\
&= P( \tau < (1.645/0.7)^2)\\
&= P(\tau < 5.52)
\end{align*}`

So `pgamma(5.52, a, b)` should be 0.95

Now, solve the two-equation, two-unknown problem

]

---

# Dealing with partial prior information

.large[
Suppose the model was actually

`$\mu_i = \beta_0 + \beta_1 {\tt Age} + \beta_2 {\tt Smoke} + \beta_3 {\tt Age \times Smoke} + \beta_4 {\tt height} + \beta_5 {\tt sex}$`

but we couldn't elicit anymore information from the expert

- standardize `height`

- elicit information as before, for subjects in each group of average height and in the baseline `sex` category

- place reference priors on `$\beta_4$` and `$\beta_5$`

`$$\beta_4, \beta_5 \sim \mathcal{N}(0, 10^6)$$`
]

---
class: middle, center, inverse

# Regularizing priors

---

## Example: Diabetes progression

.large[
**Goal:** Model disease progression one year after baseline based on data from 422 patients
]

Variable | Description
---------|---------------------------
`AGE`    | patient's age
`SEX`    | patient's sex
`BMI`    | patient's body mass index
`S1`     | blood serum measurement 1
`S2`     | blood serum measurement 2
`S3`     | blood serum measurement 3
`S4`     | blood serum measurement 4
`S5`     | blood serum measurement 5
`S6`     | blood serum measurement 6
`Y`      | disease progression one year after baseline

.footnotesize[
*Source:* Efron et al (2004). Least angle regression. *Annals of Statistics*: 32(2)
]

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

---
class: inverse

# Your turn

.Large[
Proposed model has all predictors in it:

`\begin{align*}
\mu_i =& \beta_0 + \beta_1 {\tt AGE} + \beta_2 {\tt SEX} + \beta_3 {\tt BMI} +\\ 
&\beta_4 {\tt S1} + \beta_5 {\tt S2} + \beta_6 {\tt S3} + \beta_7 {\tt S4} + \beta_8 {\tt S5}+ \beta_9 {\tt S6}
\end{align*}`

Turn to a neighbor and review the following ideas:

1. What would it mean for this model to be overfit?

1. What would it mean for this model to be underfit?
]

---

# Regularization

.large[
- If domain expertise isn't guiding your model building, then .bold[you must be skeptical of the sample!]

- Use informative, conservative priors to reduce overfitting 
`$\Rightarrow$` model learns less from the sample

- If your prior is too skeptical, the model will learn too little

- Such priors are called .bold[regularizing]
]

---

# Regularization

.pull-left[
.large[
`\begin{align*}
Y_i &\sim \mathcal{N}(x_i \boldsymbol{\beta}, \sigma^2)\\
\beta_0 &\sim \mathcal{N}(0, 1/10^6)\\
\beta_j &\sim \mathcal{N}(0, \sigma^2_0)\\
\tau &\sim {\rm InvGamma}(0.001, 0.001)\\
\end{align*}`

.content-box-yellow[
Only regularize slope terms
]
]
]

.pull-right[
<img src="21-blr-priors_files/figure-html/unnamed-chunk-3-1.svg" width="100%" />
]

---

.left-column[
## N(0, 1)
]

---

.left-column[
## N(0, 0.25)
]

---

.left-column[
## N(0, 0.04)
]

---

# Univariate Gaussian priors

.large[
In a fully Bayesian analysis, we can allow the data to determine how much to shrink the coefficients towards zero
`\begin{align*}
\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2 &\sim {\rm MVN}\left(\boldsymbol{X} \boldsymbol{\beta}, \sigma^2  \boldsymbol{I}_n \right)\\
\beta_0 & \sim \mathcal{N}(0, 10^6)\\
\beta_j | \sigma^2 &\overset{\rm iid}{\sim} \mathcal{N}(0, \sigma^2 \tau^2)\\
\sigma^2 & \sim{\rm InverseGamma}(c, c)\\
\tau^2 & \sim {\rm InverseGamma}(c, c)
\end{align*}`

where `$c$` is a small, say `$c=0.1$` or `$c=0.01$`
]

---

# Univariate Gaussian priors

.large[
- The posterior mode is given by

`$$\underset{\boldsymbol{\beta}}{\rm argmin} \sum_{i=1}^n \left( Y_i - \sum_{j=1}^p X_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2$$`

where `$\lambda = 1/\tau^2$`
    
- In classical statistics, this is known as the ridge regression
solution and is used to stabilize the least squares solution
]

---

# BLASSO

.large[
- An increasingly-popular prior is the double exponential or
Bayesian LASSO prior

- `$\beta_j \sim {\rm DE}(\tau)$` which has PDF

`$$f(\beta) \propto \exp \left( - \frac{|\beta|}{\tau} \right)$$`

- The square in the Gaussian prior is replaced with an
absolute value

- The shape of the PDF is thus more peaked at zero (next
slide)

- The BLASSO prior favors settings where there are many `$\beta_j$`
near zero and a few large `$\beta_j$`

- That is, `$p$` is large but most of the covariates are noise
]

---

# BLASSO

.large[
In a fully Bayesian analysis, we can allow the data to determine how much to shrink the coefficients towards zero
`\begin{align*}
\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2 &\sim {\rm MVN}\left(\boldsymbol{X} \boldsymbol{\beta}, \sigma^2  \boldsymbol{I}_n \right)\\
\beta_0 & \sim \mathcal{N}(0, 10^6)\\
\beta_j | \sigma^2 &\overset{\rm iid}{\sim} {\rm DE}(\sigma^2 \tau^2)\\
\sigma^2 & \sim{\rm InverseGamma}(c, c)\\
\tau^2 & \sim {\rm InverseGamma}(c, c)
\end{align*}`

where `$c$` is a small, say `$c=0.1$` or `$c=0.01$`
]

---

# BLASSO

---

# BLASSO

.large[
- The posterior mode is given by

`$$\underset{\boldsymbol{\beta}}{\rm argmin} \sum_{i=1}^n \left( Y_i - \sum_{j=1}^p X_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^p |\beta_j|$$`

where `$\lambda = 1/\tau^2$`
    
- In classical statistics, this is known as the LASSO solution

- It is popular because it adds stability by shrinking estimates
towards zero, and also sets some coefficients to zero

- Covariates with coefficients set to zero can be removed

- Therefore, LASSO performs variables selection and
estimation simultaneously
]

---

.left-column[
## Ridge
]

---

.left-column[
## BLASSO
]