fast generation of long sequences of Bernoulli-distributed random variables
Fast Bernoulli — a library for sampling from Bernoulli distribution with a specific parameters. Performant Bernoulli samplers are highly useful in scientific and engineering fields: modeling, random graphs, coding theory, etc.
The standard way to obtain Bernoulli distributed random value is sampling from uniform distribution and thresholding with a parameter. The library exploits a different property of Bernoulli random variables to attain better efficiency in the sense of (P)RNG invocations.
Namely, logical operation (like not, or, and, xor) on two Bernoulli random
variables gives another random variable distributed according to Bernoulli law
with a new parameter. So, the concept behind is that Bernoulli distribution
with any parameter can be approximated with a sequence of other Bernoulli
random variables with different parameters and logical operators applied to the
sequence. Set of logical operators is not constrained in general but this
implementation is based on and(&)
and not(~)
operators. As soon as a target
distribution parameter is "quantized" in a sequence logical operators, one can
apply the transformations to a sequence of random values in run-time to draw a
sample.
In order to archive better performance on practice, one need to improve sample efficiency in sense of memory consumption. Standard implementation returns a value of random variable as a bool-typed value which in fact consumes one byte. In contrast, library returns block of random values. Each random value encoded as a bit in a block. This fact allows to use vector extensions to instruction set.
Another important assumption is that (P)RNG generates random number and bits in generated numbers are distributed uniformly. In general, this is true especially for top-quality (P)RNG (mt19937 as example). So, the assumption allows to apply the described algorithm to bits of numbers obtained with (P)RNG.
Another one improvement is related to implementation. Execution of a sequence of logical operations on bits requires iterations in loop. Many iterations affect performance through context switch and (un)conditional branching. The best performance could be archived if distribution parameter or a set of logical operators is known in compile-time. Just-in-time (JIT) compilation facilitates overcoming the obstacle. The library supports LLVM which provides JIT compiler infrastructure.
- High efficiency per random bit.
- Vectorization with AVX.
- JIT compilation with LLVM.
- Thread-safe.
- Python bindings.
The only reason for creation of the project is efficient sampling. So, in this
section we present some benchmarks of sampler in different configurations. One
can see that the proposed implementation is ahead of standard Bernoulli sampler
(prefix Std
in table or figure).
Run on (4 X 3400 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 256K (x2)
L3 Unified 4096K (x1)
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations Throughput
--------------------------------------------------------------------------------
BM_RandomNumberGeneration/128 4786 ns 4783 ns 147228 816.740 M/s
BM_StdSampler/128 72260140 ns 72208458 ns 10 55.3952 k/s
BM_GenericSampler/128 104076 ns 103545 ns 6745 37.7252 M/s
BM_AvxSampler/128 71017 ns 70982 ns 9759 55.0314 M/s
BM_JitGenericSampler/128 103482 ns 103431 ns 6735 37.7669 M/s
BM_JitAvxSampler/128 55667 ns 55640 ns 12393 70.2053 M/s
Since one of the aims is to generate bulks of random values, it is interesting to look at sampling efficiency in sense of bitrate with regards length of generated sequence.
Build dependencies are (a) compiler with support C++17, (b) CMake as build-system generator, (c) Make as default build system. Also building requires (d) Google GTest for testing, and (e) Google Benchmark for benchmarking. Optional dependency is (f) LLVM which provides JIT compiler facility.
mkdir -p build/release && cd build/release
cmake ../.. -DCMAKE_BUILD_TYPE=Release -DBUILD_WHEEL=ON -DUSE_LLVM=ON
make
LLVM is not required by default, so it could be turned on/off with option
-DUSE_LLVM=ON/OFF
(as it is shown on the snippet above). Option BUILD_WHEEL
triggers building of Python bindings which could be find in directory
fast-bernoulli/python/
. Target for building Python bindings is not included
to a set of default targets, so one should to run building of target
fast-bernoulli-python-wheel
in advance.
make fast-bernoulli-python-wheel
The simplest way to construct sampler in C++ is just calling CreateSampler()
routine with target probability. The function returns a functor which should be
invoked with an instance of (P)RNG and aligned memory buffer. Finally, one
could bring samples to common representation (vector of bools).
#include <fast-bernoulli/fast-bernoulli.h>
using namespace NFastBernoulli;
// Assume 65536 bits with probability 0.6 should be generated.
auto probability = 0.6;
auto nobits = 65536;
// Create an instance of ISampler and make buffer for sampling.
auto sampler = CreateSampler(probability);
auto ptr = sampler.MakeBuffer(nobits);
// Use Mersenne Twister as a pseudo-random number generator.
std::mt19937_64 rng;
// Draw samples from Bernoulli distribution.
sampler(rng, ptr);
// Expand compressed representation of random values to vector of bools.
auto values = Expand(ptr, nobits);
The function CreateSampler()
is overridden in order to provide an advanced way
to instantiate sampler with structure TSamplerOpts
.
Also, the library delivers Python bindings and an extremely simple usage
interface. Module fast_bernoulli
provides function sample()
which generates
random bits and returns memory view on resulting buffer of type ui8
.
import numpy as np
import fast_bernoulli as fb
# Assume 65536 bits with probability 0.6 should be generated.
probability = 0.6;
nobits = 65536;
# Draw samples from Bernoulli distribution. JIT is used.
res = fb.sample(probability, nobits, seed=42, jit=True) # memoryview
arr = np.array(res) # np.ndarray with dtype=np.uint8