Skip to content
Romesh Abeysuriya edited this page Oct 22, 2015 · 3 revisions

The chain functions actually simulate the random walk. This is where the incremental priors are computed, and where the burn-in is implemented.

Fit phases

As described in the state tracking paper (Abeysuriya et. al., 2015), the MCMC system uses an adaptive proposal distribution. As such, the fit takes place in three stages

  1. An initial sample to generate the first approximation of the covariance matrix
  2. A burn-in period during which samples are taken but discarded. During this time, the covariance matrix adapts rapidly
  3. A long period of actual sampling. The covariance matrix still adapts during this time, but changes are much slower than at the start

We now go through a call to chain.m to show how the algorithm is implemented. Typically, users do not interact with chain() directly - rather, it is expected to be called via fit_spectrum() or similar.

The call to chain() is:

chain(model,initial_values,n_points,debugmode,timelimit,n_burnin)

These are

  • model is of course the model object being used for the fit. Note that the model is expected to have had model.prepare_for_fit already run, which means that the target experimental spectrum is already stored within the model by the time it arrives in chain()
  • initial_values is a vector of parameter values. It is critical that the initial parameters are valid i.e. not outside the parameter limits, not unstable, or otherwise anything that causes model.probability() to return a non-finite number for the initial values
  • n_points is the length of the chain, as a number. If the fit is time-limited, this is the upper bound for the chain length.
  • debugmode is a flag that will plot the trace plot if set to True. The reason for this is because the chain() function returns only the requested samples, not the discarded burn-in points. That is, if you request 10000 points, chain() will take about 11000 steps, and return the last 10000. This means that from outside chain(), you cannot plot the initial part of the chain showing the initial convergence to the stationary distribution. debugmode plots the entire chain before the function returns
  • timelimit is the maximum allowed time for the fit. If set to false, then the number of points specified in n_points will be generated. If set to a number, then chain() will still attempt to generate n_points worth of samples, but will stop immediately when the time limit is reached.
  • n_burnin explicitly specifies how many burn-in samples to use. If not provided, it is set to 10% of the requested chain length. However, when performing a parallel run, each worker runs chain() with a subset of the total number of points. Thus 10% of the chain length on the workers would be less than 10% of the output series. Without correcting for this, the results from a parallel run could be different to a serial run.

Several quantities are then populated or preallocated. These are

  • n_initial which sets how many samples should be taken using the model's initial step size (with zero covariance) for the proposal distirbution, after which the adaptive proposal distribution is used
  • n_burnin is calculated, which is the total number of points from the start of the chain that will be discarded. This number includes n_initial
  • out, which stores all of the steps in the random walk, including both the burn-in points and the actual sample points
  • final_posterior which stores the value of the posterior for corresponding parameters in out
  • accept which is a counter that keeps track of how many proposed steps have been accepted - this is a useful quantity to examine when choosing the initial step size for a model (as a general rule, an acceptance rate in the vicinity of 10-30% is good. The routine will work with higher or lower rates, but may require more samples to generate suitably valid results)

Next, several quantities for the adaptive proposal distribution are defined. Full details and references are in (Abeysuriya et. al., 2015). In summary, the idea is that the proposal distribution should have a similar covariance to the target distribution, to maximize the efficiency of the algorithm by taking larger steps in directions where the target distribution spans a wider range of values, and smaller steps in directions where the target distribution is narrower. On top of that, the width of the proposal distribution is linearly scaled by a multiplicative constant. If the constant is small, the proposal distribution will generate small steps that are likely to be accepted. If the constant is large, then the proposal distribution will generate large steps that are less likely to be accepted. As mentioned above, for various reasons the MCMC algorithm is most efficient with an acceptance rate of about 20%. To achieve this, the multiplicative constant is also adaptive (this is what is meant by 'adaptive global scaling') such that the it increases if the acceptance rate is too high, and decreases if the rate is too low.

Note that all of these ideas have been validated in analytic proofs in published papers. Adaptation or changes to the algorithm may be expected to affect convergence, and ensuring that the correctness of the sampling is preserved is a highly nontrivial task. So modifications intended to improve the sampling efficiency of the algorithm must be approached with extreme caution.

The quantities related to the proposal distribution are

  • fixed_cov which ensures that the diagonal entries of the covariance matrix are nonzero, as discussed by Rosenthal on page 15 .
  • lambda which is the scaling factor again discussed by Rosenthal on page 15. It depends on the dimension of the problem. For the corticothalamic model at least, the initial value of lambda leads to acceptance rates much lower than 20%. Therefore, we apply an additional rescaling by dividing lambda by 4. Because lambda is also adaptive, this should not significantly affect the algorithm, particularly since the theoretical value is chosen for no reason other than to achieve a particular acceptance rate
  • target_accept which is the acceptance rate that the adaptive global scaling seeks to achieve
  • alpha which pre-initializes
  • mu which stores the mean value of each parameter in the chain

The following other quantities are also preallocated

  • x - at each step, the posterior probability is evaluated for the proposed set of parameters, and this is compared to the current set of parameters. x is a 2 x n matrix that stores the parameters for the current step (in the first row), and the parameters for the proposed step (in the second row)
  • alpha - the proposed step is accepted based on the ratio of the proposed probability to the current probability, and this ratio is stored in alpha
  • initialization_stage tracks which part of the chain algorithm we are in. The possible stages are: 1 - generate initial samples, 2 - use the adaptive proposal distribution to generate burnin samples, 3 - use the adaptive proposal distribution to generate samples for output

The chain is generated in a while loop, because if execution is time-limited then the number of output points generated is not fixed - although this could equivalently be written as a for loop, the while loop in this case seems cleaner. The while loop terminates when n_burnin+n_points have been computed, and the time limit is implemented as a break statement within the loop.

At each iteration, the first task is to generate the proposed step based on the proposal distribution. If initialization_state=1, then the proposed step is generated using

x(2,:) = x(1,:) + model.initial_step_size.*randn(1,model.n_params); 

That is, it uses the current position (in x(1,:)) and adds to it a perturbation sampled from a normal distribution that is independent for each parameter. The normal distribution has a standard deviation drawn from model.initial_step_size. In chain.m, the new point is generated by taking a random step in all of the parameters simultaneously. Other algorithms involve changing one parameter at a time and keeping the others fixed, but there seems to be no strong reason to prefer either method, except that perturbing all parameters simultaneously eliminates a nested loop.

After generating the proposed step, fixed parameters are taken into account. The parameters are fixed using

x(2,model.skip_fit==2) = initial_values(model.skip_fit==2);

If skip_fit=2 then the parameter values from the current step are copied to the proposed step. The model's probability function is then called using the proposed parameters

[chisq,new_posterior] = model.probability(x(2,:));

Again, the model stores the target spectrum internally, so it is able to return chisq as a function of the model parameters only. If chisq is finite (i.e., if the parameters are valid) and the proposed step is accepted, then the accept counter is incremented and the new parameters are copied over to the current parameters. Then, the current parameters are recorded as an output sample.

Finally, the initialization_stage is updated. Once n_initial steps have been taken, an initial covariance matrix is computed.

my_cov = cov(out(1:j,:));
mu = mean(out(1:j,:));

The initialization_stage is then set to 2. On the next loop iteration, the proposed step is computed using

x(2,:) = mvnrnd_fast(x(1,:),lambda*my_cov);

which samples from a multivariate normal distribution with a specific covariance. mvnrnd_fast is a stripped-down version of Matlab's built-in mvnrnd. The rest of the loop proceeds as before, but now, the covariance matrix must be re-computed every time a new sample is generated. This is performed in an iterative fashion based on the covariance matrix from the previous step, which means that updating the covariance matrix is very efficient. In contrast, computing the covariance matrix from scratch becomes more and more computationally intensive as the number of points becomes large.

When j = n_burnin (recalling that n_burnin includes n_initial), the initialization stage is incremented once more. The accept counter is reset to 0, so that the acceptance rate can be calculated for the samples that are actually returned (although it's worth noting that the acceptance rate could also be computed by counting how many duplicates there in the output chain).

Once the while loop exits, if debugmode=true then the chain diagnostics are displayed using

bt.core.chain_diagnostics(model,out(1:(j-1),:),final_posterior(1:(j-1)),n_burnin);

Note that j is incremented at the end of the loop rather than the beginning, so the output is stored in indexes 1:j-1 rather than 1:j. Also, at this point out stores all of the samples, which means that the chain diagnostics include the burn-in points. The argument n_burnin means that chain_diagnostics is able to provide a visual cue showing the burn-in period.

Finally, the output is truncated and then the chain is returned.

Parallel fitting

MCMC sampling is ideally suited to parallelization because the samples can be generated independently. The key idea is that if we want n samples, we can run N copies of chain, with each call to chain requested n/N output points. These output points can then be aggregated into a single set of output samples. This is achieved by the function chain_parallel, which has almost the same calling syntax as chain. This makes it easy to use chain_parallel as a drop-in replacement for chain.

chain_parallel expects that a parallel pool is already open, and will throw an error if this is not the case. The number of output points required on each worker is computed. Alternatively, if a time limit is specified, that time limit is passed directly to the workers without modification. chain is then called inside a parfor loop. Finally, the outputs from each call to chain are concatenated into a single matrix that is then returned.

If debugmode=true, then the diagnostic output from each worker chain is displayed, including the burn-in period. After that, the diagnostic output from chain_parallel is displayed, which shows the final output samples that will be returned, but with no burn-in samples.

Some considerations
  • Because the chains are simply stuck together, you cannot look at the autocorrelation or other such measures in the output from chain_parallel. However, since the standard chain analysis just looks at the distribution of points in the chain, and doesn't care about the order of the points, this normally does not matter.
  • The intention is that you can just change chain() to chain_parallel() and get immediately comparable results. For consistency with chain.m, the number of burn-in steps on each worker is computed based on the total length of the chain. That is, if you request 10000 points, chain() will have 1000 points of burn-in. If you call chain_parallel with 10000 points and 4 workers, it will call chain() requesting 2500 points each. If you just call chain() with 2500 points, there will only be 250 points of burn-in. Therefore, chain_parallel calls chain with the optional argument n_burnin which is set to 1000 points (based on the total requested 10000 points). Therefore, the individual chains are run with 1000 points of burn-in and 2500 points of output. This ensures valid results, but it also has the effect of sub-linear performance increases as the number of workers is increased when the chain length is small

Time limited fitting

Users can specify an upper limit on the executation time of chain. In the case of timelimit, 20% of the time is spent on burn-in. This is higher than the 10% normally used. The required length of the burn-in is actually not dependent on the total length of the chain. That is, how many steps are required for the chain to reach it's stationary distribution does not depend on the final number of samples being generated. As a convenient approximation, we set the burn-in length to 10% of the chain, which is common practice in literature. However, as the chain gets shorter and shorter, the amount of burn-in required gets larger and larger as a percentage of the total chain length. Since timelimit is only normally used for extremely short runs, a 20% burn-in is required to achieve satisfactory convergence. You can see this for yourself by enabling debugmode.

Sometimes chain will fail with the error Burnin did not finish, something went wrong when a time limit is specified is under investigation. This can occur if computing the initial n_initial points takes longer than the total time allowed for the fit. This simply means that the computer is too slow to provide an acceptable fit within the time requested.

Clone this wiki locally