memorymodels.tex

\srcnote{Text of this chapter is based on~\mcite{SB2018x86tso}.
The benchmarks are unmodified from the original paper.}

\bigskip\noindent
This chapter presents an extension of the \divine model checker that allows for
analysis of C and C++ programs under the \xtso relaxed memory model.
We use an approach in which the program to be verified is first transformed so
that it encodes the relaxed memory behaviour, and after that, it is
verified by an explicit-state model checker supporting only the standard
sequentially consistent memory.
The novelty of our approach is in careful design of encoding of \xtso
operations so that the nondeterminism introduced by the relaxed memory
simulation is minimised.
In particular, we allow for nondeterminism only in connection with memory
fences and load operations of those memory addresses that were written to by a
preceding store.
We evaluate and compare our approach with the state-of-the-art bounded model
checker CBMC~\mcite{Clarke2004,Kroening2014} and stateless model checker Nidhugg~\mcite{Abdulla2015}.
For the comparison, we employ SV-COMP concurrency benchmarks that do not exhibit
data nondeterminism, and we show that our solution built on top of the
explicit-state model checker outperforms both of the other tools.
The implementation is publicly available as an open source software.

\section{Motivation and Introduction}

Almost all contemporary processors exhibit \emph{relaxed memory behaviour}, which is
caused by cache hierarchies, instruction reordering, and speculative execution.
This, together with the rise of parallel programs, means that programmers often have to deal with the added complexity of programming under relaxed memory.
The behaviour of relaxed memory can be highly unintuitive even on x86 processors, which have stronger memory model than most other architectures.
Therefore, programmers often have to decide whether to stay relatively safe with
higher-level synchronisation constructs such as mutexes, or whether to tap to the full power of the architecture and risk subtle unintuitive behaviour of relaxed memory accesses.
For these reasons, it is highly desirable to have robust tools for finding bugs in programs running under relaxed memory.

Our aim is primarily to help with the development of lock-free data structures and algorithms.
Instead of using higher-level synchronisation techniques such as mutexes, lock-free programs use low-level atomic operations provided by the hardware or programming language to ensure correct results.
This way, lock-free programs can exploit the full power of the architecture they target, but they are also harder to design, as the ordering of operations in the program has to be considered very carefully.
We believe that by providing a usable validation procedure for lock-free
programs, more programmers will find courage to develop fast and correct programs.
Recently, many techniques for analysis and verification which take relaxed memory into account have been developed, and research in this field is still pretty active.
In this work, we are adding a new technique which we hope will make the analysis of C and C++ programs targeting x86 processors easier.

We extend \divine with the support for the \xtso memory model \mcite{x86tso}
which describes the relaxed behaviour of x86 and x86-64 processors.
Due to the prevalence of the Intel and AMD processors with the x86-64
architecture, the \xtso memory model is a prime target for program analysis.
It is also relatively strong and therefore underapproximates most of the other
memory models -- any error which is observable on \xtso is going to manifest
itself under the more relaxed POWER or ARM memory models.
More details about \xtso can be found in \autoref{chap:prelim:xtso}.

To verify a program under \xtso, we first transform it by encoding the
semantics of the relaxed memory into the program itself, i.e., the resulting
transformed program itself simulates nondeterministically relaxed memory
operations.
To reveal an error related to the relaxed memory behaviour, it is then enough to
verify the transformed program with a regular model checker supporting only the
standard sequentially consistent memory.

In this chapter, we introduce a new way of encoding the relaxed memory behaviour
into the program.
Our new encoding introduces low amount of nondeterminism, which is the key
attribute that allows us to tackle model checking of nontrivial programs
efficiently.
In particular, we achieve this by delaying nondeterministic choices arising
from \xtso as long as possible.
Our approach is based on the standard operational semantic of \xtso with store
buffers, but it removes entries from the store buffer only when a load or a
fence occurs (or if the store buffer is bounded and full).
Furthermore, in loads, we only remove those entries from store buffers that
relate to the address being loaded, even if there are some older entries in the
store buffer.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Conventional Semantics of \xtso}

To better illustrate the advantages of our semantics for \xtso, we
will now present the more usual semantics of this memory model often used in
verification.
This semantics is based on the \xtso semantics presented in~\mcite{x86tso}.
It uses a (possibly bounded) thread-local store buffer which stores memory store entries.
The oldest entry in the store buffer can be flushed at any point, yielding all
possible ways stores can be delayed (up do the reordering bound).
Often, the nondeterministic flushing is achieved by a \emph{flusher} thread
associated with each store buffer -- this allows uniform modelling of thread
and memory model nondeterminism.
This approach is used for example by \mcite{Zhang2015} and \mcite{Abdulla2017tso}
or in our previous work in~\mcite{SRB15weakmem}.

The disadvantage of this semantics is that it introduces a lot of
nondeterminism.
Once there are any entries in the store buffer, the store buffers can be
(partially) flushed at any point the thread rescheduling is possible.
Therefore, without an additional reduction, there will often be a lot of runs
that differ only in the moment a particular value was flushed from the store
buffer to the main memory, even if this value is never read (or is read much
later).
We illustrate this behaviour in \autoref{fig:mm:conventional}.

\begin{widefigure}
  \begin{multicols}{2}
    \begin{multicols}{2}
      \begin{cppcodeln}
        void t0() {
          x = 1;
          y = 1;
        }
      \end{cppcodeln}
      \columnbreak
      \begin{cppcodeln}
        void t1() {
          z = 1;
        }
      \end{cppcodeln}
    \end{multicols}
    \columnbreak
    \begin{cppcodeln}
      void t2() {
        int a = y;
        int b = x;
        assert(!(a == 0 && b == 1));
      }
    \end{cppcodeln}
  \end{multicols}
  \scriptsize
  \newcommand{\be}[2]{\texttt{#1} ← \texttt{#2}}
  \newcommand{\pc}[2]{$\texttt{PC}_\texttt{t#1}$\texttt{ = #2}}
  \begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
                     , semithick
                     , base/.style={ rectangle, draw=black, text centered,
                                     minimum height=1.4em, inner sep=2pt }
                     , entry/.style={ base, minimum width = 4em }
                     , loc/.style={ base, minimum width = 4.5em }
                     , box/.style={ base, dashed, rounded corners,
                                    inner sep=0.3em }
                     ]

    \node[entry, anchor = north east, xshift = 0.5em] (s0t0e0) at (0,0) {\be{x}{1}};
    \node[entry, below = 0em of s0t0e0] (s0t0e1) {\be{y}{1}};

    \node[entry, right = 1em of s0t0e0] (s0t1e0) {\be{z}{1}};

    \node[loc, below left = 0.5em and -2.5em of s0t0e1] (s0x) {\texttt{x = 0}};
    \node[loc, right = 0em of s0x] (s0y) {\texttt{y = 0}};
    \node[loc, right = 0em of s0y] (s0z) {\texttt{z = 0}};
    \node[loc, below = 0em of s0x] (s0pc0) {\pc{0}{4}};
    \node[loc, below = 0em of s0y] (s0pc1) {\pc{1}{3}};
    \node[loc, below = 0em of s0z] (s0pc2) {\pc{2}{1}};

    \begin{pgfonlayer}{background}
      \node[box, fit = (s0t0e0) (s0t1e0) (s0pc0) (s0pc1) (s0pc2)] (s0) {};
    \end{pgfonlayer}


    \node[entry, anchor = north east, xshift = 0.5em] (s1t0e0) at (-16em,-10em) {\be{y}{1}};

    \node[entry, right = 1em of s1t0e0] (s1t1e0) {\be{z}{1}};

    \node[loc, below left = 0.5em and -2.5em of s1t0e0] (s1x) {\color{red}\texttt{x = 1}};
    \node[loc, right = 0em of s1x] (s1y) {\texttt{y = 0}};
    \node[loc, right = 0em of s1y] (s1z) {\texttt{z = 0}};
    \node[loc, below = 0em of s1x] (s1pc0) {\pc{0}{4}};
    \node[loc, below = 0em of s1y] (s1pc1) {\pc{1}{3}};
    \node[loc, below = 0em of s1z] (s1pc2) {\pc{2}{1}};

    \begin{pgfonlayer}{background}
      \node[box, fit = (s1t0e0) (s1t1e0) (s1pc0) (s1pc1) (s1pc2)] (s1) {};
    \end{pgfonlayer}


    \node[entry, anchor = north east, xshift = 0.5em] (s2t0e0) at (0em,-10em) {\be{x}{1}};
    \node[entry, below = 0em of s2t0e0] (s2t0e1) {\be{y}{1}};

    \node[loc, below left = 0.5em and -2.5em of s2t0e1] (s2x) {\texttt{x = 0}};
    \node[loc, right = 0em of s2x] (s2y) {\texttt{y = 0}};
    \node[loc, right = 0em of s2y] (s2z) {\color{red}\texttt{z = 1}};
    \node[loc, below = 0em of s2x] (s2pc0) {\pc{0}{4}};
    \node[loc, below = 0em of s2y] (s2pc1) {\pc{1}{3}};
    \node[loc, below = 0em of s2z] (s2pc2) {\pc{2}{1}};

    \begin{pgfonlayer}{background}
      \node[box, fit = (s2t0e0) (s2pc0) (s2pc1) (s2pc2)] (s2) {};
    \end{pgfonlayer}


    \node[entry, anchor = north east, xshift = 0.5em] (s3t0e0) at (16em,-10em) {\be{x}{1}};
    \node[entry, below = 0em of s3t0e0] (s3t0e1) {\be{y}{1}};

    \node[entry, right = 1em of s3t0e0] (s3t1e0) {\be{z}{1}};

    \node[loc, below left = 0.5em and -2.5em of s3t0e1] (s3x) {\texttt{x = 0}};
    \node[loc, right = 0em of s3x] (s3y) {\texttt{y = 0}};
    \node[loc, right = 0em of s3y] (s3z) {\texttt{z = 0}};
    \node[loc, below = 0em of s3x] (s3pc0) {\pc{0}{4}};
    \node[loc, below = 0em of s3y] (s3pc1) {\pc{1}{3}};
    \node[loc, below = 0em of s3z] (s3pc2) {\color{red}\pc{2}{2}};

    \begin{pgfonlayer}{background}
      \node[box, fit = (s3t0e0) (s3t1e0) (s3pc0) (s3pc1) (s3pc2)] (s3) {};
    \end{pgfonlayer}

    \node[below right = 6em and 0em of s2] (s2b) {…};

    \node[box, below left = 8em and 0em of s1pc1] (s1a) {\color{red}\texttt{x = z = 1}};

    \node[box, below = 3em of s1a] (s1aa) {\color{red}\texttt{x = z = 1; }\pc{2}{2}};

    \node[box, below = 3em of s1aa] (s1aaa) {\color{red}\texttt{x = y = z = 1; }\pc{2}{2}};
    \node[box, below right = 3em and 6em of s1aa] (s1aab) {\color{red}\texttt{x = z = 1; }\pc{2}{3}};

    \node[box, below = 3em of s1aab] (s1aabb) {×};

    \node[box, below = 3em of s1aaa] (s1aaaa) {\color{red}\texttt{x = y = z = 1; }\pc{2}{3}};

    \node[box, below = 3em of s1aaaa] (s1aaaaa) {×};

    \node[box, below right = 8em and 0em of s1pc1] (s1b) {\color{red}\texttt{x = 1; }\pc{2}{2}};

    \node[box, below right = 3em and 6em of s1b] (s1bb) {\color{red}\texttt{x = 1; }\pc{2}{3}};

    \node[below right = 3em and 0em of s1bb] (s1bbb) {…};
    \node[box, below right = 3em and 0em of s1bb, xshift=9em] (s1bbc) {×};

    \draw (s0) edge node[left, xshift=-1em] {flush \be{x}{1}} (s1)
          (s0) edge node {flush \be{z}{1}} (s2)
          (s0) edge node[right, xshift=1em] {\cpp{int a = y; // →0}} (s3)

          (s1) edge node[left, xshift=2.7em, yshift=1em] {flush \be{z}{1}} (s1a)
          (s1) edge node[right, xshift=-2.55em, yshift=1em] {\cpp{int a = y; // →0}} (s1b)

          (s1a) edge node[xshift=-2em, yshift=0.5em] {\cpp{int a = y; // →0}} (s1aa)
          (s1b) edge node {flush \be{z}{1}} (s1aa)
          (s1b) edge node[right, xshift=0.5em] {\cpp{int b = x; // →1}} (s1bb)

          (s1aa) edge node[yshift=-0.5em] {flush \be{y}{1}} (s1aaa)
          (s1aaa) edge node {\cpp{int b = x; // →1}} (s1aaaa)
          (s1aaaa) edge node {\texttt{assert(…)}} (s1aaaaa)

          (s1aa) edge node[right, xshift=1em] {\cpp{int b = x; // →1}} (s1aab)
          (s1aab) edge node[right, xshift=-0.8em, yshift=-1em] {flush \be{y}{1}} (s1aaaa.east)
          (s1aab) edge node {\texttt{assert(…)}} (s1aabb)

          (s1bb) edge node[right, xshift=1em] {flush \be{z}{1}} (s1aab)
          (s1bb) edge node[right, xshift=1em, yshift=-0.5em] {flush \be{y}{1}} (s1bbb)
          (s1bb) edge[bend left, out=340, looseness=0.7] node[right, xshift=1em, yshift=0.5em] {\texttt{assert(…);}} (s1bbc)

          (s2) edge[bend left, in=210] node[left, xshift=14.5em, yshift=2em] {flush \be{x}{1}} (s1a)
          (s2) edge node[left, xshift=1em, yshift=-1.5em] {\cpp{int a = y; // →0}} (s2b)

          (s3) edge[bend left, in=210] node[above, xshift=-2em] {flush \be{x}{1}} (s1b)
          (s3) edge node[right, xshift=-1em, yshift=-1.5em] {flush \be{z}{1}} (s2b)
      ;
  \end{tikzpicture}
  \caption[An example parallel program and a fragment of its state space.]{
    An example parallel program and a fragment of its state space.
    We start in a state where the threads \texttt{t0} and \texttt{t1} had
    already executed but did not flush their buffers yet and \texttt{t2} did
    not start executing.
    This fragment of the state space shows (some of) the states that reach the
    assertion violation, i.e., states where $\texttt{a} = 0 \land \texttt{b} =
    1$.
    The states marked with \texttt{×} are the states which have reached the
    assertion violation.
    States in the first two rows of the state space show store buffers on the
    top and state of the memory and program counters on the bottom.
    In the rest of the state space, we show only differences in the memory and
    program counters.

    We can see that the state space contains many redundant runs.
    First, flushes of \be{z}{1} are irrelevant as \texttt{z} is never read.
    The same holds for flushes of \be{y}{1} after the execution of line 2 in
    \texttt{t2}.
    Finally, flushes of \be{x}{1} are only relevant just before the execution
    of line 2 in \texttt{t2}, which is the only one which reads \texttt{x}.
  }\label{fig:mm:conventional}
\end{widefigure}

In practice, analysers for parallel programs employ state space reductions which
can remove some of the nondeterminism introduced by the flusher thread.
However, the flusher thread can behave somewhat differently from normal program
threads -- it is not necessary to execute the flusher thread at all if the
store buffer it belongs to does not contain a value which will be read and it
suffices to execute it just before such a read.
Taking this into account requires either changes to the reduction mechanism, or
to the semantics of \xtso simulation.
We have decided to take the second route and show that it opens room for
further optimisations not possible with the conventional flusher-thread
implementation.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{\xtso in \divine}\label{sec:work}


\divine does not natively support relaxed memory, and we decided not to complicate the already complex execution engine and memory representation with a simulation of relaxed behaviour.
Instead, we encode the relaxed behaviour into the program itself on the level of \llvm intermediate representation.
The modified program running under sequential consistency simulates all \xtso runs of the original program, up to some bound on the number of stores which can be delayed.
The program transformation is rather similar to the one presented in our previous work in \mcite{SRB15weakmem}.
The main novelty is in the way of simulation of \xtso which produces significantly less nondeterminism and therefore substantial efficiency improvements.

\subsection{Simulation of the \xtso Memory Model}

The most straightforward way of simulating \xtso is to add store buffers to the program and flush them nondeterministically, for example using a dedicated flusher thread which flushes one entry at a time and interleaves freely with all other threads.
We used this technique in \mcite{SRB15weakmem}.
This approach does, however, create many redundant interleavings as the flusher
thread can flush an entry at any point, regardless of whether or not it is going
to produce a run with a different memory access ordering, i.e. without any
respect to whether the flushed value is going to be read or not.

To alleviate this problem, it is possible to delay the choice whether to flush
an entry from a store buffer to the point when the first load tries to read a buffered address.
Only when such a load is detected, all possible ways the store buffers could have been flushed are simulated.
In this case, the load can trigger flushing from any of the store buffers, to simulate that they could have been flushed before the load.
To further improve the performance, only entries relevant to the loaded address
are affected by the flushing.
These are the entries with matching addresses and any entries which precede them in the corresponding store buffers (that are flushed before them to maintain the store order).

A disadvantage of this approach is that there are too many ways in which a store
buffer entry can be flushed, especially if this entry is not the oldest in its
store buffer, or if there are entries concerning the same addresses in multiple store buffers.
All of these cases can cause many entries to be flushed, often with a multitude of interleavings of entries from different store buffers which has to be simulated.

\paragraph{Delayed Flushing} %
To alleviate aforementioned explosion of possible orders in which the store
buffers can be flushed, we propose \emph{delayed flushing}: entries in the
store buffers can be kept in the store buffer after newer entries were flushed
if the retained entries are marked as \emph{flushed}.
Such the entries behave as if they were already written to the main memory, but can still be reordered with entries in other store buffers.
That is, when there is a flushed entry for a given location in any store buffer, the value stored in the memory is irrelevant as any load will either read the flushed entry or entry from the other store buffer (which can be written after the flushed entry).
Flushed entries make it possible to remove store buffer entries out of order while preserving total store order of observable operations.
This way, a load only affects entries from the matching addresses and not their predecessors in the store order.
This improvement is demonstrated in~\autoref{fig:mm:flushflag}.


\begin{figure}[tp]
\begin{subfigure}[t]{0.3\textwidth}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
                   , semithick
                   , scale=0.5
                   , thck/.style = { thick, decoration={markings,mark=at position 1 with {\arrow[scale=4]{>}}}, postaction={decorate}, },
                   ]

  \draw [-] (-10,0) rectangle (-7,-4);
  \draw [-] (-10,-1) -- (-7,-1)
            (-10,-2) -- (-7,-2)
            (-10,-3) -- (-7,-3);
  \node () [anchor=center] at (-8.5, 0.5) {s.b. 1};
  \node () [anchor=center] at (-8.5,-0.5) {\texttt{x $\leftarrow$ 1}};
  \node () [anchor=center] at (-8.5,-1.5) {\texttt{y $\leftarrow$ 1}};
  \node () [anchor=center] at (-8.5,-2.5) {\texttt{x $\leftarrow$ 2}};
  \node () [anchor=center] at (-8.5,-3.5) {\color{frombuf}\texttt{y $\leftarrow$ 2}};

  \draw [-] (-6,0) rectangle (-3,-4);
  \draw [-] (-6,-1) -- (-3,-1)
            (-6,-2) -- (-3,-2)
            (-6,-3) -- (-3,-3);
  \node () [anchor=center] at (-4.5, 0.5) {s.b. 2};
  \node () [anchor=center] at (-4.5,-0.5) {\texttt{x $\leftarrow$ 3}};
  \node () [anchor=center] at (-4.5,-1.5) {\texttt{y $\leftarrow$ 3}};
\end{tikzpicture}
\begin{cppcode*}{linenos=false}
void thread0() {
  int a = y;
  int b = x;
}
\end{cppcode*}
\vspace{-\medskipamount}
\caption{
Suppose \texttt{thread0} is about to execute with the displayed contents of store buffers of two other threads and suppose it had nondeterministically chosen to load value 2 from \texttt{y} (denoted by \textcolor{frombuf}{green} in the figure).
The entries at the top of the store buffers are the oldest entries.
}\label{fig:mm:flushflagA}

\end{subfigure}
%
\hfill
%
\begin{subfigure}[t]{0.3\textwidth}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
                   , semithick
                   , scale=0.5
                   , thck/.style = { thick, decoration={markings,mark=at position 1 with {\arrow[scale=4]{>}}}, postaction={decorate}, },
                   ]

  \draw [-] (-10,0) rectangle (-7,-4);
  \draw [-] (-10,-1) -- (-7,-1)
            (-10,-2) -- (-7,-2)
            (-10,-3) -- (-7,-3);
  \node () [anchor=center] at (-8.5, 0.5) {s.b. 1};
  \node () [anchor=center] at (-8.5,-0.5) {\color{flushed}\texttt{x $\leftarrow$ 1}};
  \node () [anchor=center] at (-8.5,-1.5) {\color{flushed}\texttt{x $\leftarrow$ 2}};

  \draw [-] (-6,0) rectangle (-3,-4);
  \draw [-] (-6,-1) -- (-3,-1)
            (-6,-2) -- (-3,-2)
            (-6,-3) -- (-3,-3);
  \node () [anchor=center] at (-4.5, 0.5) {s.b. 2};
  \node () [anchor=center] at (-4.5,-0.5) {\color{frombuf}\texttt{x $\leftarrow$ 3}};
  \node () [anchor=center] at (-4.5,-1.5) {\texttt{y $\leftarrow$ 3}};
\end{tikzpicture}
\begin{cppcode*}{linenos=false}
  void thread0() {
    int a = y;// →2
    int b = x;
  }
\end{cppcode*}
\vspace{-\medskipamount}
\caption{At this point, \texttt{x}~entries of store buffer 1 are marked as flushed (\textcolor{flushed}{orange}) and the \mbox{$\texttt{y} \leftarrow \texttt{1}$} entry was removed as it was succeeded by the used entry \mbox{$\texttt{y} \leftarrow \texttt{2}$}.
The thread had nondeterministically selected to load \texttt{x} from store buffer 2.
}\label{fig:mm:flushflagB}

\end{subfigure}
%
\hfill
%
\begin{subfigure}[t]{0.3\textwidth}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
                   , semithick
                   , scale=0.5
                   , thck/.style = { thick, decoration={markings,mark=at position 1 with {\arrow[scale=4]{>}}}, postaction={decorate}, },
                   ]

  \draw [-] (-10,0) rectangle (-7,-4);
  \draw [-] (-10,-1) -- (-7,-1)
            (-10,-2) -- (-7,-2)
            (-10,-3) -- (-7,-3);
  \node () [anchor=center] at (-8.5, 0.5) {s.b. 1};

  \draw [-] (-6,0) rectangle (-3,-4);
  \draw [-] (-6,-1) -- (-3,-1)
            (-6,-2) -- (-3,-2)
            (-6,-3) -- (-3,-3);
  \node () [anchor=center] at (-4.5, 0.5) {s.b. 2};
  \node () [anchor=center] at (-4.5,-0.5) {\texttt{y $\leftarrow$ 3}};

\end{tikzpicture}
\begin{cppcode*}{linenos=false}
  void thread0() {
    int a = y;// →2
    int b = x;// →3
  }
\end{cppcode*}
\vspace{-\medskipamount}
\caption{
In the load of \texttt{x}, all \texttt{x} entries were evicted from the buffers -- all the flushed entries for \texttt{x} (which were not selected) had to be dropped before \mbox{$\texttt{x} \leftarrow \texttt{3}$} was propagated to the memory.
The last entry (\mbox{$\texttt{y} \leftarrow \texttt{3}$}) will remain in the store buffer if \texttt{y} will never be loaded in the program again.
}\label{fig:mm:flushflagC}

\end{subfigure}
\caption{An illustration of the \emph{delayed flushing}.}\label{fig:mm:flushflag}
\end{figure}

\paragraph{Program Transformation}%
%
\divine handles C and C++ code by translating it to \llvm and instrumenting it (see Figure \ref{fig:workflow} for \divine's workflow).
The support for relaxed memory is added in the instrumentation step, by replacing memory operations with calls to functions which simulate relaxed behaviour.
Essentially, all loads, stores, atomic compound instructions\mnote{Compare and swap, atomic exchange, and compound arithmetic and logic operations such as fetch and add}, and fences are replaced by calls to the appropriate functions.\mnote{With the exception of atomic compound operations, all modifications to the memory in LLVM must be performed by a series of instructions containing a load to a register, modifications of the register, and store of register value to the memory. Therefore it is sufficient to consider only this limited number of instructions as other instructions do not access memory at all.}
All of the \xtso-simulating functions are implemented so that they are executed atomically by \divine (i.e., not interleaved).

\paragraph{\xtso Memory Operations}
%
The most complex of these is the load operation.
It first finds all entries with overlap the loaded address (\emph{matching entries}) and out of these matching entries, it nondeterministically selects entries which will be written before the load (\emph{selected entries}).
All matching entries marked as flushed have to be selected for writing.
Similarly, all matching entries which occur in a store buffer before a selected entry also have to be selected.
Out of the selected entries, one is selected to be written last -- this will be the entry read by the load.
Next, selected entries are written, and all nonmatching entries which precede them are marked as flushed.
Finally, the load is performed, either from the local store buffer if matching
entry exists there, or from the shared memory.

The remaining functions are relatively straightforward.
Stores push a new entry to the store buffer, possibly evicting the oldest entry
from the store buffer if the store buffer exceeds its size bound.
Fences flush all entries from the store buffer of the calling thread to the
memory.
Atomic operations are basically a combination of a load, store, and a fence --
all atomics are sequentially consistent under \xtso.
The only intricate part of these operations is that if an entry is flushed out
of the store buffer, the entries from other store buffers which involve the
same memory location can also be nondeterministically flushed (to simulate
they could have been flushed before the given entry).
This flushing is similar to flushing performed in load.
An example which shows a series of loads can be found in~\autoref{fig:mm:flushflag}.

\paragraph{Correctness}%
%
We will now argue that this way of implementing \xtso is correct.
First, the nondeterminism in selecting entries to be flushed before a load serves the same purpose as the nondeterminism in the flusher thread of the more conventional implementation.
The only difference is that in the flusher-thread scenario the entries are
flushed in order, while in our new approach we are selecting only from the matching entries.
Therefore, the difference between the two approaches is only on those entries
which are not loaded by the load causing the flush, hence cannot be observed by the load.
However, any entry which would be flushed before the selected entries in the flusher-thread approach is now marked with the flushed flag.
This flag makes sure that such an entry will be flushed before an address which matches it is loaded, and therefore it behaves as if it was flushed.
This way, the in-thread store order is maintained.


\subsection{Stores to Freed Memory}

As \xtso simulation can delay memory stores, special care must be taken to preserve memory safety of the program.
More precisely, it is necessary to prevent the transformed program from writing into freed memory.
This problem occurs if a store to dynamically allocated memory is delayed after the memory is freed, or if a store to stack location is delayed after the corresponding function had returned.
This problem does not require special handling in normal program execution as both stack addresses as well as dynamic memory addresses remain to be writable for the program even after they are freed (except for memory-mapped files, but these have to be released by a system call which includes sufficiently strong memory barrier which prevents memory accesses from being delayed past the system call).

To solve the problem of freed memory, it is necessary to evict store buffer entries which correspond to the freed memory just before the memory is freed.
For entries not marked as flushed, this eviction concerns only store buffer of the thread which freed the memory.
If some other thread attempted to write to the freed memory, this is an error as there is insufficient synchronisation between the freeing and the store to the memory.
However, corresponding entries marked as flushed should be evicted from all store buffers, as these entries correspond to changes which should have been already written to the shared memory.

Eviction of dynamic memory is straightforward -- the program transformation injects a call to the eviction function just before every call to the function which releases memory.
For eviction of stack memory, it is necessary to evict all local addresses whenever a function exits, regardless of the way it exits.
This means we also have to take into account exceptions and other ways of performing non-local transfers of control (e.g., \texttt{longjmp}).
The program transformation takes care of tracking which local memory addresses should be evicted and inserts code to evict them at the end of functions.

\subsection{Integration with Other Parts of \divine}

The integration of \xtso simulation with the rest of \divine is straightforward in most cases.
No changes are required in the \divine's execution engine or state space exploration algorithms.
As for the libraries shipped with \divine, only minor tweaks were required.
The \texttt{pthread} implementation had to be modified to add full memory barrier both at the beginning and at the end of every synchronising functions.
This corresponds to barriers present in the implementations used for normal execution, \texttt{pthread} mutexes and other primitives have to guarantee sequential consistency of the guarded operations (provided all accesses are properly guarded).

The \divine's operating system, \dios, is used to implement low-level threading as well as simulation of various filesystem APIs \mcite{RBMKB2019}.
We had to add memory barrier into the system call entry which hands control to \dios.
\dios itself does not use relaxed memory simulation -- the implementation of \xtso operations detects that the code is executed in the kernel mode and bypasses store buffers.
In this way, the entire \dios executes as if under sequential consistency which makes its design simpler and more robust.
This synchronisation is easily justifiable -- system calls require a memory barrier or kernel lock in most operating systems.

\paragraph{Relaxed Memory and Abstract Data}
%
DIVINE has support for symbolic and abstract data (data nondeterminism) which
uses program transformation to implement symbolic manipulation of
data~\mcite{LRB2018}.
However, the integration of support for abstract data and for memory models is
not yet done as it requires more complex changes than just a straightforward
composition of the two program transformations.
Performing transformation for abstractions first is not possible currently, as
it introduces additional memory operations and therefore is not correct in
presence of relaxed memory.
On the other hand, it should be possible to run the transformation which
introduces relaxed memory behaviour first, and only after that run the
transformation which introduces abstraction.
However, this would mean that if an abstract value is ever stored to memory
(i.e., it is not only present in LLVM registers) then all values loaded from
the memory are potentially abstract, because the store buffers themselves need
to be abstract.
Therefore, this approach is not practical because it introduces too much extra
overhead.
Overall, we defer this problem to future work.

\subsection{Further Optimizations} \label{sec:mm:opt}

We have implemented two further optimisations of our \xtso simulation.

\paragraph{Static Local Variable Detection}
Accesses of local variables which are not accessible to other threads need not
use store buffering.  For this reason, we have inserted a static analysis pass
which annotates accesses to local memory before the \xtso instrumentation.  The
instrumentation ignores such annotated accesses.  The static analysis can
detect most local variables which are never accessed using pointers.


\paragraph{Dynamic Local Memory Detection}
\divine can also dynamically detect if the given memory object is shared between threads (i.e., it is accessible from global variables or stacks of more then one thread).
Using this information, it is possible to dynamically bypass store buffers for operations with non-shared memory objects.
This optimisation is correct even though the shared status of memory can change during its lifetime.
A memory object $o$ can become shared only when its address is written to some memory object $s$ which is already shared (or $o$ can become shared transitively through a series of pointers and intermediate objects).
For this to happen, there has to be a store to the already shared object $s$, and this store has to be propagated to other threads.
Once the store of the address of $o$ is executed and written to the store buffer, $o$ becomes shared, and any newer stores into it will go through the store buffer.
Furthermore, once this store is propagated, any store which happened before turning $o$ into a shared object also had to be propagated as \xtso does not reorder stores.
Therefore, there is no reason to put stores to $o$ through the store buffer if $o$ is not shared.
This optimisation is not correct for memory models which allow store reordering -- for such memory models, we would need to know that the object will never be shared during its lifetime.

\subsection{Bounding the Size of Store Buffers}

The complexity of analysis of programs under the \xtso memory model is high.
From the theoretical point of view, we know due to~\mcite{wmdecidability} that reachability for programs with finite-state threads which run under TSO is decidable, but non-primitive recursive (it is in \textsc{pspace} for sequential consistency).
The proof uses the so-called SPARC TSO memory model~\mcite{SPARC94} that is very similar to \xtso. However, the proof of decidability does not translate well to an efficient decision procedure, and real-world programs are much more complex than the finite-state systems used in the decidability proof.

For this reason, we would need to introduce unbounded store buffers to properly
verify real-world programs. Unfortunately, this can be impractical, especially for programs which do not terminate.
Therefore, our program transformation
inserts store buffers of limited size, limiting thus the number of store
operations that can be delayed at any given time. The
size of the store buffers is fully configurable, and it currently defaults to
32, a value probably high enough to discover most bugs which can be observed on
real hardware.

Our implementation also supports the store buffers of unlimited size (when size
is set to 0). In this mode, programs with infinite loops that write into shared
memory will not have finite state space under \xtso even if they have finite
state space under sequential consistency.
Therefore, \divine will not terminate unless it discovers an error in the
program.
Verification with unbounded buffers will still terminate for terminating
programs and for all programs with errors.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Evaluation} \label{sec:evaluation}

The implementation is available at \url{http://divine.fi.muni.cz/2018/x86tso/}, together with information about how to use it.
It is also integrated into DIVINE releases.
We compared our implementation with the stateless model checker Nidhugg~\mcite{Abdulla2015} and the bounded model checker CBMC~\mcite{Clarke2004,Kroening2014}.
For evaluation we used SV-COMP 2017 benchmarks from the Concurrency
category~\mcite{SV-COMP:2017}, excluding benchmarks with data
nondeterminism\mnote{I.e., all the benchmarks which contain calls to
functions of the \texttt{\_\_VERIFIER\_nondet\_*} family were excluded.} as
DIVINE does not yet support data nondeterminism with relaxed memory.
Furthermore, due to the limitation of stateless model checking with DPOR,
Nidhugg cannot handle data nondeterminism at all.
There are 55 benchmarks in total.

The evaluation was performed on a machine with two dual core Intel Xeon 5130
processors running at 2 GHz with 16 GB of RAM. Each tool was running with memory
limit set to 10 GB and time limit set to 1 hour. The tools were not limited in
the number of CPUs they can use.

We have used CBMC version 5.8 with the option \texttt{-{}-mm tso}. Since there is no official release of Nidhugg, we have used version 0.2 from git, commit id
\texttt{375c554} with \texttt{-tso} option to enable relaxed memory support and inserted a definition of the \texttt{\_\_VERIFIER\_error} function.
For \divine, we have used the \texttt{-{}-svcomp} option to enable support for SV-COMP atomic sections (which are supported by default by CBMC and Nidhugg), and we disabled nondeterministic memory failure by using the \texttt{divine check} command (SV-COMP does not consider the possibility of allocation failure).
To enable \xtso analysis, \texttt{-{}-relaxed-memory tso} is used for \divine.\mnote{The complete invocation is \texttt{divine check -{}-svcomp -{}-relaxed-memory tso BENCH.c}.}
The buffer bound was the default value (32) unless stated otherwise.

\begin{table}[tt]
\caption{
    Comparison of the default configuration of \divine with CBMC and Nidhugg.
} \label{tab:mm:default:svc}
\centering
\begin{tabular}{lrrr} \toprule
             & CBMC & \hspace*{1ex} Nidhugg & \hspace*{1ex} \divine \\ \midrule
    finished &   21 &      25 &     27 \\
    TSO bugs &    3 &       3 &      9 \\
    unique   &    5 &       3 &      5 \\
  \bottomrule
\end{tabular}
\end{table}

\autoref{tab:mm:default:svc} compares performance of the default configuration of \divine with the remaining tools.
The line ``finished'' shows the total number of benchmarks for which the verification task finished with the given limits.
From these the line ``TSO bugs'' shows the number of errors caused by relaxed memory in benchmarks which were not supposed to contain any bugs under sequential consistency.
All discovered errors were manually checked to really be observable under the \xtso memory model.
Finally, ``unique'' shows the number of benchmarks solved only by the given tool and not the other two. There were only 8 benchmarks solved by all three tools, all of them without any errors found.

\begin{table}[th]
\caption{
    Comparison of various configurations of \divine.
    The ``base'' version uses none of the improvements from \autoref{sec:mm:opt}.
    The configurations marked with ``s'' add the static local variable optimisation, while the configurations marked with ``d'' add the dynamic detection of non-shared memory objects.
    The ``+sdu'' configuration has both optimisations enabled and it has unbounded buffers.
    Finally, the ``+sd4'' has buffer bound set to 4 entries instead of the default 32 entries.
    The default version is ``+sd''.
} \label{tab:mm:divine:svc}
\centering
\begin{tabularx}{\textwidth}{lRRRRRR} \toprule
           & base & +s & +d & +sd & +sdu & +sd4 \\ \midrule
  finished &   26 & 26 & 27 &  27 &   27 &   27 \\
  TSO bugs &    8 &  8 &  9 &   9 &    9 &    9 \\
  \bottomrule
\end{tabularx}
\bigskip
\caption{
    Comparison of various versions of \divine on benchmarks on the 26 which all the versions finished.
    For the description of these versions, refer to \autoref{tab:mm:divine:svc}.
} \label{tab:mm:divine:time:svc}
\centering
\begin{tabularx}{\textwidth}{lRRRRRR} \toprule
           &    base &      +s &      +d &     +sd &    +sdu &    +sd4 \\ \midrule
  states   &   252 k &   263 k &   250 k &   231 k &   206 k &   296 k \\
  time     & 2:14:49 & 2:17:13 & 1:09:23 & 1:05:05 & 0:58:28 & 1:24:59 \\
  \bottomrule
\end{tabularx}
\end{table}

\autoref{tab:mm:divine:svc} shows effects of buffer size bound and improvements described in \autoref{sec:mm:opt}.
It can be seen that all versions perform very similarly, only one more benchmark was solved by the versions with dynamic shared object detection (the remaining solved benchmarks were the same for all versions).
The number of solved benchmarks remains the same regardless of used store buffer bound.

\autoref{tab:mm:divine:time:svc} offers more detailed look at the 26 benchmarks solved by all versions of \divine.
It shows the aggregate differences in state space sizes and solving times.
It can be seen that the dynamic shared object detection improves performance significantly.
Interestingly, we can see that of the 3 versions which differ only in store buffer size (``+sd'', ``+sdu'', and ``+sd4''), the unbounded version performs the best.
We expect this to be caused by the nondeterminism in flushing the excessive entries out of the store buffer when the bound is reached -- this can trigger flushing of matching entries from other store buffers and therefore increase nondeterminism.

\section{Related work}

A lot of techniques have support for program analysis under relaxed memory models, most of these were already described in \autoref{chap:stateoftheart} and therefore, we will not replicate the description here.
A lot of these techniques is build on stateless model checking (see also \autoref{sec:stateoftheart:smc}) including \mcite{Demsky2015,Zhang2015,Huang2016,Abdulla2017tso} for \xtso and \mcite{Abdulla2016,Norris2016,Kokologiannakis2017,Abdulla2018} for other memory models. Bounded model checking (see also \autoref{sec:stateoftheart:bmc}) is another popular choice for analysis of \xtso \mcite{Alglave2013po,Gavrilenko2019,Tomasco2017} and other memory models \mcite{Burckhardt2007,Abdulla2017}.
An explicit-state approach to programs running under the C\# relaxed memory model is presented in \mcite{Huang2016}.
There are also previous works which use program transformation in the analysis of relaxed memory models \mcite{Alglave2013,Abdulla2017,SRB15weakmem}.

Our approach to simulation of \xtso is somewhat similar to the approach presented in \mcite{Abdulla2018}. Both approaches perform nondeterministic choice primarily on memory loads. However, \cite{Abdulla2018} is focusing on the release-acquire fragment of C11 and does not support sequential consistency.
Similarly, \mcite{Kokologiannakis2017} presents a technique that attempts to minimise nondeterminism, this time in the analysis of a modified version of the C11 memory model.

There are also some approaches to  theanalysis of relaxed memory models which were not mentioned in \autoref{chap:stateoftheart}, mostly because they lack implementation which works with realistic programming languages.
In \mcite{Atig2011} a program transformation that uses context switch bounding but not buffer bounding is presented. It uses additional copies for shared variables for TSO simulation. However, the evaluated was performed with manually converted programs.
An unbounded approach to verification of programs under TSO is presented in~\mcite{Linden2010}.
It uses store buffers represented by automata and leverages cycle iteration acceleration to get a representation of store buffers on paths which would form cycles if values in store buffers were disregarded.
It targets a modified Promela modelling language~\mcite{Holzmann1997}.
Another unbounded approach is presented in~\mcite{Bouajjani2015}.
It introduces TSO behaviours lazily by iterative refinement, and while it is not complete, it should eventually find all errors.
The work~\mcite{Dan2013} presents verification of (potentially infinite state space) programs under TSO and PSO (with bounded store buffers) using predicate abstraction. It first analyses the program under sequential consistency and then extrapolates the predicates to TSO/PSO and performs further analysis (and possibly refinement).
% CUPEX \mcite{Dan2013}
% - explores possibilities for combination of predicate abstraction and relaxed memory models (TSO, PSO)
% - new predictates are extrapolated from predicates used under SC
% - first verifies the program under SC, then constructs a modified program which encodes the memory model (runs of the modified program under SC simulate buffer-bounded runs of the original program under the given memory model), finally it discovers new predicates for the modified program and check it.
% - probably a modelling language

\paragraph{Absence of Relaxed Behaviour}

There are also techniques which aim to solve the problem of absence of relaxed behaviour in a given program.
For these methods, the question is whether a program, when running under a relaxed memory model, exhibits any runs not possible under sequential consistency.
This problem is explored under many names, e.g. (TSO-)safety~\mcite{Burckhardt2008}, robustness~\mcite{Bouajjani2013,Derevenetc2014}, stability~\mcite{Alglave2011}, and monitoring of sequential consistency~\mcite{Burnim2011}.
Similar techniques are used in \mcite{Yang2004} to detect data races in Java programs.
A related problem of correspondence between a parallel and sequential implementation of a data structure is explored in~\mcite{Ou2017}.
Some of these techniques can be also used to insert memory fences into the programs to recover sequential consistency.

Neither of these techniques are directly comparable to our method.
For these techniques, a program is incorrect if it exhibits relaxed behaviour, while for us, it is incorrect if it violates specification (e.g., assertion safety and memory safety).
In practice, the appearance of relaxed behaviour is often not a problem, provided the overall behaviour of the data structure or algorithm matches the desired specification.
In many lock-free data structures, relaxed behaviour is essential to achieving high performance.

\paragraph{Other Methods}\label{other-methods}

In~\mcite{Park1995}, the SPARC hierarchy of memory models (TSO, PSO, RMO) is modelled using encoding from assembly to Mur\(\varphi\)~\mcite{Murphi}.
A completely different approach is taken in~\mcite{Turon2014}.
This work introduces a separation logic GPS, which allows proving properties about programs using (a fragment of) the C11 memory model.
This work is intended for manual proving of properties of parallel programs, not for automatic verification.
The memory model used in this work is not complete; it lacks relaxed and release-consume accesses.
Another fragment of the C11 memory model is targeted by the RSL separation logic introduced in~\mcite{Vafeiadis2013}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Conclusion} \label{sec:conclusion}

We showed that by careful design of simulation of relaxed memory behaviour we
can use the standard model checker supporting only the sequential consistency to
efficiently detect relaxed memory errors in programs that are otherwise correct
under sequentially consistent memory. Moreover, according to our experimental
evaluation, our explicit-state model checking approach outperforms
a state-of-the-art stateless model checker as well as bounded model checker,
which is actually quite an unexpected result. We also show that many of the used
benchmarks can be solved only by one or two of the three evaluated tools, which
highlights the importance of employing different approaches to the analysis of programs
under relaxed memory. Finally, we show that for terminating programs, our
approach is viable both with bounded and unbounded store buffer size.

% vim: colorcolumn=80 expandtab sw=4 ts=4 spell spelllang=en