Skip to content

Latest commit

 

History

History
124 lines (66 loc) · 8.3 KB

ReversibleJumps.md

File metadata and controls

124 lines (66 loc) · 8.3 KB

Bayesian model selection: Reversible jump MCMC

THERE WILL BE NO MEASURE THEORY IN THIS BLOG.

I will be using terms like 'density' and 'distribution' interchagably and I'm not sorry. 

Hello and welcome to another episode of max loses his sanity the derivation that reviewer 2 rejected. This episode, its a headlong dive into the concrete slab (and spike?) of sampling for Bayesian model order selection.

It seems to often be the case in modelling that we are interested in not only the parameterisation of a model but also which model we should use in the first place.

In a world free from any uncertainty (sign me up), we could just apply cross-validation for each model and simply select the one that best describes the data. Fortunately for us (I still have a job), uncertainty is as much a part of life as paper rejections and existential dread.

Model selection is made more complex by the presence of uncertainty. There are several options that we might consider:

Option the first: Bayes factors

Something something marginal likelihoods, something $\mathcal{M}$-closed. Just go to Wikipedia, the world doesn't need another explanation from me.

Option the second: Extended posterior

Rather than evaluate the Bayes factors pairwise for all model pairs, it might be argued that a more natural Bayesian treatment is to target the discrete marginal posterior over the model choice directly,

$$p(\mathcal{M} = \mathcal{M}_k|y) = p(k|y)$$

where $k$ is an indicator variable that implies the data is generated by $\mathcal{M}_k$. Notice in the above that we have implicitly marginalised over the parameterisation of each model $\theta$.

Applying Bayes rule,

$$p(k,\theta|y)=\frac{p(y|k,\theta)p(k)p(\theta|k)}{p(y)}$$

This all seems fairly standard, but a flagrant abuse of notation is concealing a big problem, the parameterisation of each $\mathcal{M}_k$ are not necessarily the same, that there is not just a single $\theta$ in our posterior but one for each model! A convenient factorisation that makes this clear is,

$$p(k, [\theta^{(1)}, ...,\theta^{(n)}]|y) = p(k|y)p(\theta^{(k)}|k,y)$$

This at least looks a bit more sensible, but a closed form inference is not going to be possible and we are going to have to resort to a sampling-based approach. However, we won't just be able to employ our favourite gradient based sampling scheme and ride off into the Bayesian sunset. As we shall see, the structure of our statistical model requires special treatment (and not just a kick in the NUTS)...

Sorry, why can't I just adopt a hierarchical sampling approach?

If you are anything like me, you might be staring at the above and thinking, why can't I just adopt a standard Metropolis-Hastings sampling scheme that goes something like:

  1. initialise the chain at ($k, \theta$)
  2. sample $k'\sim q(k)$
  3. sample $\theta' \sim q(\theta|k, \theta)$
  4. compute $p(y|k', \theta)$
  5. move the chain to ($k', \theta'$) with probability $\alpha$

$$\alpha = \frac {p(y|k', \theta')p(k')p(\theta'|k')q(k |k')q(\theta |k', \theta')} {p(y|k, \theta )p(k )p(\theta |k )q(k'|k )q(\theta'|k, \theta)}$$

  1. repeat
  2. ???
  3. profit

Where $q$ is an appropriately chosen proposal density, then plug it into STAN and put the kettle on.

The devil here is in the detail. Eagle-eyed viewers will no-doubt have noticed that the proposal distribution $q$ defines a stochastic map from $\theta^{(k)}\in \Theta^k$ to $\theta^{(k')}\in \Theta^{k'}$. These model parameters do not necessarily live in the same subspace ($\Theta^k \neq \Theta^{k'}$). This causes some big headaches. Although the above scheme can be made to work (by being clever with $q$ as we shall see), one has to be very careful to select a valid proposal distribution that still targets our posterior.

Detailed balance - yoga for your sampler?

In order to draw samples from a target distribution $\pi$, there are a number of restrictions that must be placed on the proposal distribution $q$. Together with the MH accept/reject step these form a Markov Kernel. This kernel must have the following properties.

  1. Aperiodicity (does not have any finite period)
  2. Irreducibility (any part of the support can be reached from anywhere - even if unlikely)

In practice, designing a kernel with these features can be tricky. Instead, a stronger (but more easily enforced) restriction is often applied - detailed balance.

Ignoring the model selection problem for now, the condition of detailed balance ensures that,

$$\pi(\theta)K(\theta', \theta)=\pi(\theta')K(\theta,\theta')$$

Where $\pi$ is the target stationary density and $K$ is the Markov kernel of the MCMC chain. Intuitively, this is equivalent to,

The probability of being in state $\theta$ and then transitioning to state $\theta'$ is eaqual to the probability of being in state $\theta'$ and transitioning to state $\theta$

Expanding out the kernel term into a proposal density and a accept/reject step we have,

$$\pi(\theta)q(\theta'| \theta)\alpha(\theta'| \theta)=\pi(\theta')q(\theta|\theta')\alpha(\theta|\theta')$$

Where $q$ is our proposal density and $\alpha$ is the acceptance rate. With a little rearrangement, we derive the relation,

$$\frac{\alpha(\theta'|\theta)}{\alpha(\theta| \theta')} = \frac{\pi(\theta')q(\theta| \theta')}{\pi(\theta)q(\theta'| \theta)}$$

From here it is easy to see that the 'Metropolis choice' of $\alpha$ satisfies the above,

$$\alpha(\theta', \theta) = \min \left( 1, \frac{\pi(\theta')q(\theta| \theta')}{\pi(\theta)q(\theta'| \theta)} \right)$$

Which is exactly the Metropolis step that we are used to seeing.

"But max", I hear you ask, wasn't this a blog post about model selection? Why do we care about detailed balance? and why have you just re-derived the MH algorithm? Don't they have it on Wikipedia alongside everything else you've ever written about?

Indeed they do, but let us examine the effect of trying to reconstruct the above for a move between model orders 1 and 2. The detailed balance condition states that,

$$\pi(\theta^{(1)})q(\theta^{(2)}| \theta^{(1)})\alpha(\theta^{(2)}| \theta^{(1)})=\pi(\theta^{(2)})q(\theta^{(1)}|\theta^{(2)})\alpha(\theta^{(1)}|\theta^{(2)})$$

$$\frac{\alpha(\theta^{(2)}| \theta^{(1)})}{\alpha(\theta^{(1)}|\theta^{(2)})} = \frac{\pi(\theta^{(2)})q(\theta^{(1)}|\theta^{(2)})}{\pi(\theta^{(1)})q(\theta^{(2)}| \theta^{(1)})}$$

The trouble here is with the ratio,

$$\frac{q(\theta^{(1)}|\theta^{(2)})}{q(\theta^{(2)}| \theta^{(1)})}$$

What exactly is the trouble with this? The trouble is that in the numerator we have defined a density that maps from model $\mathcal{M}_2$ to model $\mathcal{M}_2$, and in the denominator its the other way around! How can $q$ map in both directions at once if the models have a different number of parameters?

This problem is referred in the original paper as the dimension matching requirement. Put simply, we need to ensure that the proposal density $q$ maps from and to spaces of equal dimension. In order to solve this, Green proposes to break down the proposal distribution $q$ into deterministic and random parts such that,

$$q(\theta^{(2)} |\theta^{(1)}) = h(\theta^{(1)}, g(u))$$

where $h$ is a deterministic one-to-one mapping and $u$ is a random vector sampled from the distribution $g$. The idea here is to 'augment' the parameter vector $\theta^{(1)}$ with an additional vector $u$ such that $\theta^{(2)}$ and ${\theta^{(1)}, u}$ have the same dimension.

Intuitively, we could think about moving from a model with 3 terms (a quadratic perhaps) to a model with 4 terms (a cubic). In order to make this step a (very) naïve approach would be to keep the quadratic terms and simply draw the new cubic term from some proposal distribution. Thus we could interpret our new cubic term as being $u$ and $h$ as an identity mapping. Because the proposal step has 4 parameters in both directions, we are observing the dimension matching criteria.

Detailed balance and RJMCMC, the full picture.

In the previous example, we were only concerned with a jump to a higher dimensional model. In practice we will want to be able to make jumps both up and down model orders. A more general proposal is therefore,

$$q(\theta^{(2)} |\theta^{(1)}) = h(\theta^{(1)}, g(u))$$

A contrived concrete example: Polynomial order selection.

Imagine that we have observed some data $y$ that we believe to have been generated by a polynomial of unknown order.

TBC