THERE WILL BE NO MEASURE THEORY IN THIS BLOG.
I will be using terms like 'density' and 'distribution' interchagably and I'm not sorry.
Hello and welcome to another episode of max loses his sanity the derivation that reviewer 2 rejected. This episode, its a headlong dive into the concrete slab (and spike?) of sampling for Bayesian model order selection.
It seems to often be the case in modelling that we are interested in not only the parameterisation of a model but also which model we should use in the first place.
In a world free from any uncertainty (sign me up), we could just apply cross-validation for each model and simply select the one that best describes the data. Fortunately for us (I still have a job), uncertainty is as much a part of life as paper rejections and existential dread.
Model selection is made more complex by the presence of uncertainty. There are several options that we might consider:
Something something marginal likelihoods, something
Rather than evaluate the Bayes factors pairwise for all model pairs, it might be argued that a more natural Bayesian treatment is to target the discrete marginal posterior over the model choice directly,
where
Applying Bayes rule,
This all seems fairly standard, but a flagrant abuse of notation is concealing a big problem, the parameterisation of each
This at least looks a bit more sensible, but a closed form inference is not going to be possible and we are going to have to resort to a sampling-based approach. However, we won't just be able to employ our favourite gradient based sampling scheme and ride off into the Bayesian sunset. As we shall see, the structure of our statistical model requires special treatment (and not just a kick in the NUTS)...
If you are anything like me, you might be staring at the above and thinking, why can't I just adopt a standard Metropolis-Hastings sampling scheme that goes something like:
- initialise the chain at (
$k, \theta$ ) - sample
$k'\sim q(k)$ - sample
$\theta' \sim q(\theta|k, \theta)$ - compute
$p(y|k', \theta)$ - move the chain to (
$k', \theta'$ ) with probability$\alpha$
- repeat
- ???
- profit
Where
The devil here is in the detail. Eagle-eyed viewers will no-doubt have noticed that the proposal distribution
In order to draw samples from a target distribution
- Aperiodicity (does not have any finite period)
- Irreducibility (any part of the support can be reached from anywhere - even if unlikely)
In practice, designing a kernel with these features can be tricky. Instead, a stronger (but more easily enforced) restriction is often applied - detailed balance.
Ignoring the model selection problem for now, the condition of detailed balance ensures that,
Where
The probability of being in state
Expanding out the kernel term into a proposal density and a accept/reject step we have,
Where
From here it is easy to see that the 'Metropolis choice' of
Which is exactly the Metropolis step that we are used to seeing.
"But max", I hear you ask, wasn't this a blog post about model selection? Why do we care about detailed balance? and why have you just re-derived the MH algorithm? Don't they have it on Wikipedia alongside everything else you've ever written about?
Indeed they do, but let us examine the effect of trying to reconstruct the above for a move between model orders 1 and 2. The detailed balance condition states that,
The trouble here is with the ratio,
What exactly is the trouble with this? The trouble is that in the numerator we have defined a density that maps from model
This problem is referred in the original paper as the dimension matching requirement. Put simply, we need to ensure that the proposal density
where
Intuitively, we could think about moving from a model with 3 terms (a quadratic perhaps) to a model with 4 terms (a cubic). In order to make this step a (very) naïve approach would be to keep the quadratic terms and simply draw the new cubic term from some proposal distribution. Thus we could interpret our new cubic term as being
In the previous example, we were only concerned with a jump to a higher dimensional model. In practice we will want to be able to make jumps both up and down model orders. A more general proposal is therefore,
Imagine that we have observed some data
TBC