Specifying conjugate priors

# Specifying conjugate priors
### Math 315: Bayesian Statistics

---

## 1. Poisson-gamma example

## 2. Prior predictive distribution

## 3. Mixture priors

## 4. Natural conjugate priors

---

# Poisson-gamma model for a rate

- Number of virus attacks per day on a computer network

- Number of Ebola cases per day

- Number of diseased trees per square mile in a forest
]

--
.large[
**Target:** estimate of the true rate, `$\lambda$`

* Make observations over time period of length *N* OR make observations over region of area *N*

* Observe `$Y = 0, 1, 2, \ldots$` events

* Obtain the posterior, calculate a 95% interval, test a claim about `$\lambda$` 
]

---

# Poisson-gamma model for a rate

## Assumption

## Likelihood

## Prior

## Posterior

]

`$Y|\lambda \sim \text{Poisson}(N\lambda)$`

`$\lambda > 0$`, so Gamma(*a*, *b*) is a good choice

`$\lambda |Y \sim \text{Gamma}(a + Y, b + N)$`
]

]

---
class: inverse
# Your turn

.Large[
`$$\widehat{\lambda} = E(\lambda | Y) = \dfrac{a + Y}{b+N} = (1-w) \dfrac{a}{b} + w \dfrac{Y}{n}$$`

1. Find an expression for `$w$`.

2. Describe `$Y/N$` and `$a/b$` in words.

3. If we view Gamma(*a*, *b*) as a data augmentation prior, what do *a* and *b* represent?

4. When (in terms of *N*, *a*, and *b*) is  `$\widehat{\lambda}$` close to `$Y/N$`? How does this help you set a "weak" prior?

5. When (in terms of *N*, *a*, and *b*) is  `$\widehat{\lambda}$` shrunk towards `$a/b$`?
]

---
class: inverse
# Your turn

.Large[
Suppose that causes of death in 2018 are reviewed in detail for a city with a population of 200,000. It is found that 3 people died of asthma, giving a crude estimated asthma mortality rate in the city of 1.5 cases per 100,000 people per year.

You propose the Poisson model `$Y|\lambda \sim \text{Poisson}(2 \lambda)$`, where `$\lambda$` denotes the true underlying long-term asthma mortality rate in the city, measured in cases per 100,000 people per year.

Reviews of asthma mortality rates around the world suggest that mortality rates about 1.5 per 100,000 people are rare in Western countries, with typical asthma mortality rates around 0.6 per 100,000.

]

---

# Prior predictive distribution

```r
# Be sure to set S, a, and b
prior_lambdas <- rgamma(S, a, b) # draw lambdas
prior_ys <- rpois(S, 2 * prior_lambdas) # draw ys
```

???

A predictive distribution that integrates over unknown parameters but is not conditional on observed data is called a prior predictive distribution. **Such a distribution can be useful in evaluating if a prior distribution for `$\theta$` actually translates into reasonable prior beliefs for observable data.**

---

# Sensitivity analysis
.large[
.content-box-blue[
Any time informative priors are used you should conduct a
**sensitivity analysis**
]

- Compare the posterior for several priors

- Describe how the posterior changes

- Example: [https:
//www.ncbi.nlm.nih.gov/pubmed/19010642](https:
//www.ncbi.nlm.nih.gov/pubmed/19010642)
]

---

Source: [https://www.ncbi.nlm.nih.gov/pubmed/19010642](https:
//www.ncbi.nlm.nih.gov/pubmed/19010642)
]
]

---

# Combining prior knowledge

- Prior information can come from a variety of sources

- Say that source `$j$` (e.g., journal article, expert, pilot study)
suggests prior `$\pi_j(\theta)$`

- A mixture of experts prior combines the `$j$` sources into a
single prior distribution
]

---

# Mixture prior

Suppose we have two experts consulted for the asthma mortality example

- `$E(\lambda) = a/b  = 0.6$` 
- `$\displaystyle\int_0^{1.5} \pi(\lambda | a, b) d\lambda = 0.975$`

`$\Rightarrow\lambda \sim \text{Gamma}(a = 2.7, b = 4.5)$`
]

- `${\rm mode} (\lambda) = (a-1)/b =  0.8$` 
- `$\displaystyle\int_0^{2} \pi(\lambda | a, b) d\lambda = 0.95$`

`$\Rightarrow\lambda \sim \text{Gamma}(a = 4.2, b = 4)$`

]

---

# Computing with a mixture prior

```r
lambda <- seq(0, 10, by = 0.001)
prior <- 
 0.6 * dgamma(lambda, 2.7, 4.5) + 0.4 * dgamma(lambda, 4.2, 4)
```
]

---

# Natural conjugate priors

.content-box-yellow[
.large[
Is there a natural way of constructing a class of conjugate priors given a specific sampling density, `$f(y|\theta)$`?

*Answer:* Yes, for most of the distributions we have seen in stats up to this point (i.e. *the exponential family of distributions*)

]

---

# Exponential family of distributions

`$$f(y | \theta) = h(x) c(\theta) \exp \left( \sum_{i=1}^k w_i(\theta) t_i(x) \right)$$`

Note: `$k=1$` for a single-parameter model
]

Notice that

`$$f(y | \theta) \propto c(\theta) \exp \left( \sum_{i=1}^k w_i(\theta) t_i(x) \right)$$`

`$t_i(x)$` is said to be a *sufficient statistic* for `$\theta$`

---
class:inverse, middle, center

#Nonconjugate prior elicitation

---

# Histogram approach

Sometimes it's easier to get a researcher/expert to choose a set of sub-intervals and assign probabilities or relative likelihoods to those intervals.

1. Bin the parameter space

1. Decide on the amount of probability (or the relative "likelihood") in each bin

1. Normalize if required
]

---

```r
# Create a histogram prior using LearnBayes::histprior()
library(LearnBayes)
theta <- seq(0, 1, by = 0.001)
mid_points <- seq(0.05, 0.95, by = .1)
prior_likelihood <- c(1, 1, 1, 1, 2, 4, 8, 8, 4, 2)
prior_prob <- prior_likelihood / sum(prior_likelihood)

# Calc. prior probability for every point on the theta grid 
prior <- histprior(theta, mid_points, prior_prob)
```

---

# Relative likelihood approach

Instead of choosing intervals, you can have guide an expert to specify key points of a frequency polygon that you can then "fill in".

1. Identify the "most likely" and "least likely" points in the parameter space

1. Determine the relative "likelihood" for several more points

1. Interpolate (usually linearly)

1. Normalize
]

???

This takes the histogram prior approach a step farther. Instead of choosing intervals, you're essentially trying to construct a key points of a frequency polygon that you can then "fill in".

---

```r
# set up your prior beliefs
key_thetas <- seq(0, 1, by = 0.1)
prior_likelihood <- c(1, 1, 1, 1, 1, 2, 4, 8, 8, 4, 2)
prior_prob <- prior_likelihood / sum(prior_likelihood)

# interpolate to get the prior via approxfun()
rel_likelihood <- approxfun(x = key_thetas, y = prior_prob, method = "linear")

# add the prior probs on the grid
theta <- seq(0, 1, by = 0.001)
prior <- rel_likelihood(theta)
```