-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmemorymodels.tex
760 lines (639 loc) · 47.2 KB
/
memorymodels.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
\srcnote{Text of this chapter is based on~\mcite{SB2018x86tso}.
The benchmarks are unmodified from the original paper.}
\bigskip\noindent
This chapter presents an extension of the \divine model checker that allows for
analysis of C and C++ programs under the \xtso relaxed memory model.
We use an approach in which the program to be verified is first transformed so
that it encodes the relaxed memory behaviour, and after that, it is
verified by an explicit-state model checker supporting only the standard
sequentially consistent memory.
The novelty of our approach is in careful design of encoding of \xtso
operations so that the nondeterminism introduced by the relaxed memory
simulation is minimised.
In particular, we allow for nondeterminism only in connection with memory
fences and load operations of those memory addresses that were written to by a
preceding store.
We evaluate and compare our approach with the state-of-the-art bounded model
checker CBMC~\mcite{Clarke2004,Kroening2014} and stateless model checker Nidhugg~\mcite{Abdulla2015}.
For the comparison, we employ SV-COMP concurrency benchmarks that do not exhibit
data nondeterminism, and we show that our solution built on top of the
explicit-state model checker outperforms both of the other tools.
The implementation is publicly available as an open source software.
\section{Motivation and Introduction}
Almost all contemporary processors exhibit \emph{relaxed memory behaviour}, which is
caused by cache hierarchies, instruction reordering, and speculative execution.
This, together with the rise of parallel programs, means that programmers often have to deal with the added complexity of programming under relaxed memory.
The behaviour of relaxed memory can be highly unintuitive even on x86 processors, which have stronger memory model than most other architectures.
Therefore, programmers often have to decide whether to stay relatively safe with
higher-level synchronisation constructs such as mutexes, or whether to tap to the full power of the architecture and risk subtle unintuitive behaviour of relaxed memory accesses.
For these reasons, it is highly desirable to have robust tools for finding bugs in programs running under relaxed memory.
Our aim is primarily to help with the development of lock-free data structures and algorithms.
Instead of using higher-level synchronisation techniques such as mutexes, lock-free programs use low-level atomic operations provided by the hardware or programming language to ensure correct results.
This way, lock-free programs can exploit the full power of the architecture they target, but they are also harder to design, as the ordering of operations in the program has to be considered very carefully.
We believe that by providing a usable validation procedure for lock-free
programs, more programmers will find courage to develop fast and correct programs.
Recently, many techniques for analysis and verification which take relaxed memory into account have been developed, and research in this field is still pretty active.
In this work, we are adding a new technique which we hope will make the analysis of C and C++ programs targeting x86 processors easier.
We extend \divine with the support for the \xtso memory model \mcite{x86tso}
which describes the relaxed behaviour of x86 and x86-64 processors.
Due to the prevalence of the Intel and AMD processors with the x86-64
architecture, the \xtso memory model is a prime target for program analysis.
It is also relatively strong and therefore underapproximates most of the other
memory models -- any error which is observable on \xtso is going to manifest
itself under the more relaxed POWER or ARM memory models.
More details about \xtso can be found in \autoref{chap:prelim:xtso}.
To verify a program under \xtso, we first transform it by encoding the
semantics of the relaxed memory into the program itself, i.e., the resulting
transformed program itself simulates nondeterministically relaxed memory
operations.
To reveal an error related to the relaxed memory behaviour, it is then enough to
verify the transformed program with a regular model checker supporting only the
standard sequentially consistent memory.
In this chapter, we introduce a new way of encoding the relaxed memory behaviour
into the program.
Our new encoding introduces low amount of nondeterminism, which is the key
attribute that allows us to tackle model checking of nontrivial programs
efficiently.
In particular, we achieve this by delaying nondeterministic choices arising
from \xtso as long as possible.
Our approach is based on the standard operational semantic of \xtso with store
buffers, but it removes entries from the store buffer only when a load or a
fence occurs (or if the store buffer is bounded and full).
Furthermore, in loads, we only remove those entries from store buffers that
relate to the address being loaded, even if there are some older entries in the
store buffer.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conventional Semantics of \xtso}
To better illustrate the advantages of our semantics for \xtso, we
will now present the more usual semantics of this memory model often used in
verification.
This semantics is based on the \xtso semantics presented in~\mcite{x86tso}.
It uses a (possibly bounded) thread-local store buffer which stores memory store entries.
The oldest entry in the store buffer can be flushed at any point, yielding all
possible ways stores can be delayed (up do the reordering bound).
Often, the nondeterministic flushing is achieved by a \emph{flusher} thread
associated with each store buffer -- this allows uniform modelling of thread
and memory model nondeterminism.
This approach is used for example by \mcite{Zhang2015} and \mcite{Abdulla2017tso}
or in our previous work in~\mcite{SRB15weakmem}.
The disadvantage of this semantics is that it introduces a lot of
nondeterminism.
Once there are any entries in the store buffer, the store buffers can be
(partially) flushed at any point the thread rescheduling is possible.
Therefore, without an additional reduction, there will often be a lot of runs
that differ only in the moment a particular value was flushed from the store
buffer to the main memory, even if this value is never read (or is read much
later).
We illustrate this behaviour in \autoref{fig:mm:conventional}.
\begin{widefigure}
\begin{multicols}{2}
\begin{multicols}{2}
\begin{cppcodeln}
void t0() {
x = 1;
y = 1;
}
\end{cppcodeln}
\columnbreak
\begin{cppcodeln}
void t1() {
z = 1;
}
\end{cppcodeln}
\end{multicols}
\columnbreak
\begin{cppcodeln}
void t2() {
int a = y;
int b = x;
assert(!(a == 0 && b == 1));
}
\end{cppcodeln}
\end{multicols}
\scriptsize
\newcommand{\be}[2]{\texttt{#1} ← \texttt{#2}}
\newcommand{\pc}[2]{$\texttt{PC}_\texttt{t#1}$\texttt{ = #2}}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
, semithick
, base/.style={ rectangle, draw=black, text centered,
minimum height=1.4em, inner sep=2pt }
, entry/.style={ base, minimum width = 4em }
, loc/.style={ base, minimum width = 4.5em }
, box/.style={ base, dashed, rounded corners,
inner sep=0.3em }
]
\node[entry, anchor = north east, xshift = 0.5em] (s0t0e0) at (0,0) {\be{x}{1}};
\node[entry, below = 0em of s0t0e0] (s0t0e1) {\be{y}{1}};
\node[entry, right = 1em of s0t0e0] (s0t1e0) {\be{z}{1}};
\node[loc, below left = 0.5em and -2.5em of s0t0e1] (s0x) {\texttt{x = 0}};
\node[loc, right = 0em of s0x] (s0y) {\texttt{y = 0}};
\node[loc, right = 0em of s0y] (s0z) {\texttt{z = 0}};
\node[loc, below = 0em of s0x] (s0pc0) {\pc{0}{4}};
\node[loc, below = 0em of s0y] (s0pc1) {\pc{1}{3}};
\node[loc, below = 0em of s0z] (s0pc2) {\pc{2}{1}};
\begin{pgfonlayer}{background}
\node[box, fit = (s0t0e0) (s0t1e0) (s0pc0) (s0pc1) (s0pc2)] (s0) {};
\end{pgfonlayer}
\node[entry, anchor = north east, xshift = 0.5em] (s1t0e0) at (-16em,-10em) {\be{y}{1}};
\node[entry, right = 1em of s1t0e0] (s1t1e0) {\be{z}{1}};
\node[loc, below left = 0.5em and -2.5em of s1t0e0] (s1x) {\color{red}\texttt{x = 1}};
\node[loc, right = 0em of s1x] (s1y) {\texttt{y = 0}};
\node[loc, right = 0em of s1y] (s1z) {\texttt{z = 0}};
\node[loc, below = 0em of s1x] (s1pc0) {\pc{0}{4}};
\node[loc, below = 0em of s1y] (s1pc1) {\pc{1}{3}};
\node[loc, below = 0em of s1z] (s1pc2) {\pc{2}{1}};
\begin{pgfonlayer}{background}
\node[box, fit = (s1t0e0) (s1t1e0) (s1pc0) (s1pc1) (s1pc2)] (s1) {};
\end{pgfonlayer}
\node[entry, anchor = north east, xshift = 0.5em] (s2t0e0) at (0em,-10em) {\be{x}{1}};
\node[entry, below = 0em of s2t0e0] (s2t0e1) {\be{y}{1}};
\node[loc, below left = 0.5em and -2.5em of s2t0e1] (s2x) {\texttt{x = 0}};
\node[loc, right = 0em of s2x] (s2y) {\texttt{y = 0}};
\node[loc, right = 0em of s2y] (s2z) {\color{red}\texttt{z = 1}};
\node[loc, below = 0em of s2x] (s2pc0) {\pc{0}{4}};
\node[loc, below = 0em of s2y] (s2pc1) {\pc{1}{3}};
\node[loc, below = 0em of s2z] (s2pc2) {\pc{2}{1}};
\begin{pgfonlayer}{background}
\node[box, fit = (s2t0e0) (s2pc0) (s2pc1) (s2pc2)] (s2) {};
\end{pgfonlayer}
\node[entry, anchor = north east, xshift = 0.5em] (s3t0e0) at (16em,-10em) {\be{x}{1}};
\node[entry, below = 0em of s3t0e0] (s3t0e1) {\be{y}{1}};
\node[entry, right = 1em of s3t0e0] (s3t1e0) {\be{z}{1}};
\node[loc, below left = 0.5em and -2.5em of s3t0e1] (s3x) {\texttt{x = 0}};
\node[loc, right = 0em of s3x] (s3y) {\texttt{y = 0}};
\node[loc, right = 0em of s3y] (s3z) {\texttt{z = 0}};
\node[loc, below = 0em of s3x] (s3pc0) {\pc{0}{4}};
\node[loc, below = 0em of s3y] (s3pc1) {\pc{1}{3}};
\node[loc, below = 0em of s3z] (s3pc2) {\color{red}\pc{2}{2}};
\begin{pgfonlayer}{background}
\node[box, fit = (s3t0e0) (s3t1e0) (s3pc0) (s3pc1) (s3pc2)] (s3) {};
\end{pgfonlayer}
\node[below right = 6em and 0em of s2] (s2b) {…};
\node[box, below left = 8em and 0em of s1pc1] (s1a) {\color{red}\texttt{x = z = 1}};
\node[box, below = 3em of s1a] (s1aa) {\color{red}\texttt{x = z = 1; }\pc{2}{2}};
\node[box, below = 3em of s1aa] (s1aaa) {\color{red}\texttt{x = y = z = 1; }\pc{2}{2}};
\node[box, below right = 3em and 6em of s1aa] (s1aab) {\color{red}\texttt{x = z = 1; }\pc{2}{3}};
\node[box, below = 3em of s1aab] (s1aabb) {×};
\node[box, below = 3em of s1aaa] (s1aaaa) {\color{red}\texttt{x = y = z = 1; }\pc{2}{3}};
\node[box, below = 3em of s1aaaa] (s1aaaaa) {×};
\node[box, below right = 8em and 0em of s1pc1] (s1b) {\color{red}\texttt{x = 1; }\pc{2}{2}};
\node[box, below right = 3em and 6em of s1b] (s1bb) {\color{red}\texttt{x = 1; }\pc{2}{3}};
\node[below right = 3em and 0em of s1bb] (s1bbb) {…};
\node[box, below right = 3em and 0em of s1bb, xshift=9em] (s1bbc) {×};
\draw (s0) edge node[left, xshift=-1em] {flush \be{x}{1}} (s1)
(s0) edge node {flush \be{z}{1}} (s2)
(s0) edge node[right, xshift=1em] {\cpp{int a = y; // →0}} (s3)
(s1) edge node[left, xshift=2.7em, yshift=1em] {flush \be{z}{1}} (s1a)
(s1) edge node[right, xshift=-2.55em, yshift=1em] {\cpp{int a = y; // →0}} (s1b)
(s1a) edge node[xshift=-2em, yshift=0.5em] {\cpp{int a = y; // →0}} (s1aa)
(s1b) edge node {flush \be{z}{1}} (s1aa)
(s1b) edge node[right, xshift=0.5em] {\cpp{int b = x; // →1}} (s1bb)
(s1aa) edge node[yshift=-0.5em] {flush \be{y}{1}} (s1aaa)
(s1aaa) edge node {\cpp{int b = x; // →1}} (s1aaaa)
(s1aaaa) edge node {\texttt{assert(…)}} (s1aaaaa)
(s1aa) edge node[right, xshift=1em] {\cpp{int b = x; // →1}} (s1aab)
(s1aab) edge node[right, xshift=-0.8em, yshift=-1em] {flush \be{y}{1}} (s1aaaa.east)
(s1aab) edge node {\texttt{assert(…)}} (s1aabb)
(s1bb) edge node[right, xshift=1em] {flush \be{z}{1}} (s1aab)
(s1bb) edge node[right, xshift=1em, yshift=-0.5em] {flush \be{y}{1}} (s1bbb)
(s1bb) edge[bend left, out=340, looseness=0.7] node[right, xshift=1em, yshift=0.5em] {\texttt{assert(…);}} (s1bbc)
(s2) edge[bend left, in=210] node[left, xshift=14.5em, yshift=2em] {flush \be{x}{1}} (s1a)
(s2) edge node[left, xshift=1em, yshift=-1.5em] {\cpp{int a = y; // →0}} (s2b)
(s3) edge[bend left, in=210] node[above, xshift=-2em] {flush \be{x}{1}} (s1b)
(s3) edge node[right, xshift=-1em, yshift=-1.5em] {flush \be{z}{1}} (s2b)
;
\end{tikzpicture}
\caption[An example parallel program and a fragment of its state space.]{
An example parallel program and a fragment of its state space.
We start in a state where the threads \texttt{t0} and \texttt{t1} had
already executed but did not flush their buffers yet and \texttt{t2} did
not start executing.
This fragment of the state space shows (some of) the states that reach the
assertion violation, i.e., states where $\texttt{a} = 0 \land \texttt{b} =
1$.
The states marked with \texttt{×} are the states which have reached the
assertion violation.
States in the first two rows of the state space show store buffers on the
top and state of the memory and program counters on the bottom.
In the rest of the state space, we show only differences in the memory and
program counters.
We can see that the state space contains many redundant runs.
First, flushes of \be{z}{1} are irrelevant as \texttt{z} is never read.
The same holds for flushes of \be{y}{1} after the execution of line 2 in
\texttt{t2}.
Finally, flushes of \be{x}{1} are only relevant just before the execution
of line 2 in \texttt{t2}, which is the only one which reads \texttt{x}.
}\label{fig:mm:conventional}
\end{widefigure}
In practice, analysers for parallel programs employ state space reductions which
can remove some of the nondeterminism introduced by the flusher thread.
However, the flusher thread can behave somewhat differently from normal program
threads -- it is not necessary to execute the flusher thread at all if the
store buffer it belongs to does not contain a value which will be read and it
suffices to execute it just before such a read.
Taking this into account requires either changes to the reduction mechanism, or
to the semantics of \xtso simulation.
We have decided to take the second route and show that it opens room for
further optimisations not possible with the conventional flusher-thread
implementation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\xtso in \divine}\label{sec:work}
\divine does not natively support relaxed memory, and we decided not to complicate the already complex execution engine and memory representation with a simulation of relaxed behaviour.
Instead, we encode the relaxed behaviour into the program itself on the level of \llvm intermediate representation.
The modified program running under sequential consistency simulates all \xtso runs of the original program, up to some bound on the number of stores which can be delayed.
The program transformation is rather similar to the one presented in our previous work in \mcite{SRB15weakmem}.
The main novelty is in the way of simulation of \xtso which produces significantly less nondeterminism and therefore substantial efficiency improvements.
\subsection{Simulation of the \xtso Memory Model}
The most straightforward way of simulating \xtso is to add store buffers to the program and flush them nondeterministically, for example using a dedicated flusher thread which flushes one entry at a time and interleaves freely with all other threads.
We used this technique in \mcite{SRB15weakmem}.
This approach does, however, create many redundant interleavings as the flusher
thread can flush an entry at any point, regardless of whether or not it is going
to produce a run with a different memory access ordering, i.e. without any
respect to whether the flushed value is going to be read or not.
To alleviate this problem, it is possible to delay the choice whether to flush
an entry from a store buffer to the point when the first load tries to read a buffered address.
Only when such a load is detected, all possible ways the store buffers could have been flushed are simulated.
In this case, the load can trigger flushing from any of the store buffers, to simulate that they could have been flushed before the load.
To further improve the performance, only entries relevant to the loaded address
are affected by the flushing.
These are the entries with matching addresses and any entries which precede them in the corresponding store buffers (that are flushed before them to maintain the store order).
A disadvantage of this approach is that there are too many ways in which a store
buffer entry can be flushed, especially if this entry is not the oldest in its
store buffer, or if there are entries concerning the same addresses in multiple store buffers.
All of these cases can cause many entries to be flushed, often with a multitude of interleavings of entries from different store buffers which has to be simulated.
\paragraph{Delayed Flushing} %
To alleviate aforementioned explosion of possible orders in which the store
buffers can be flushed, we propose \emph{delayed flushing}: entries in the
store buffers can be kept in the store buffer after newer entries were flushed
if the retained entries are marked as \emph{flushed}.
Such the entries behave as if they were already written to the main memory, but can still be reordered with entries in other store buffers.
That is, when there is a flushed entry for a given location in any store buffer, the value stored in the memory is irrelevant as any load will either read the flushed entry or entry from the other store buffer (which can be written after the flushed entry).
Flushed entries make it possible to remove store buffer entries out of order while preserving total store order of observable operations.
This way, a load only affects entries from the matching addresses and not their predecessors in the store order.
This improvement is demonstrated in~\autoref{fig:mm:flushflag}.
\begin{figure}[tp]
\begin{subfigure}[t]{0.3\textwidth}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
, semithick
, scale=0.5
, thck/.style = { thick, decoration={markings,mark=at position 1 with {\arrow[scale=4]{>}}}, postaction={decorate}, },
]
\draw [-] (-10,0) rectangle (-7,-4);
\draw [-] (-10,-1) -- (-7,-1)
(-10,-2) -- (-7,-2)
(-10,-3) -- (-7,-3);
\node () [anchor=center] at (-8.5, 0.5) {s.b. 1};
\node () [anchor=center] at (-8.5,-0.5) {\texttt{x $\leftarrow$ 1}};
\node () [anchor=center] at (-8.5,-1.5) {\texttt{y $\leftarrow$ 1}};
\node () [anchor=center] at (-8.5,-2.5) {\texttt{x $\leftarrow$ 2}};
\node () [anchor=center] at (-8.5,-3.5) {\color{frombuf}\texttt{y $\leftarrow$ 2}};
\draw [-] (-6,0) rectangle (-3,-4);
\draw [-] (-6,-1) -- (-3,-1)
(-6,-2) -- (-3,-2)
(-6,-3) -- (-3,-3);
\node () [anchor=center] at (-4.5, 0.5) {s.b. 2};
\node () [anchor=center] at (-4.5,-0.5) {\texttt{x $\leftarrow$ 3}};
\node () [anchor=center] at (-4.5,-1.5) {\texttt{y $\leftarrow$ 3}};
\end{tikzpicture}
\begin{cppcode*}{linenos=false}
void thread0() {
int a = y;
int b = x;
}
\end{cppcode*}
\vspace{-\medskipamount}
\caption{
Suppose \texttt{thread0} is about to execute with the displayed contents of store buffers of two other threads and suppose it had nondeterministically chosen to load value 2 from \texttt{y} (denoted by \textcolor{frombuf}{green} in the figure).
The entries at the top of the store buffers are the oldest entries.
}\label{fig:mm:flushflagA}
\end{subfigure}
%
\hfill
%
\begin{subfigure}[t]{0.3\textwidth}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
, semithick
, scale=0.5
, thck/.style = { thick, decoration={markings,mark=at position 1 with {\arrow[scale=4]{>}}}, postaction={decorate}, },
]
\draw [-] (-10,0) rectangle (-7,-4);
\draw [-] (-10,-1) -- (-7,-1)
(-10,-2) -- (-7,-2)
(-10,-3) -- (-7,-3);
\node () [anchor=center] at (-8.5, 0.5) {s.b. 1};
\node () [anchor=center] at (-8.5,-0.5) {\color{flushed}\texttt{x $\leftarrow$ 1}};
\node () [anchor=center] at (-8.5,-1.5) {\color{flushed}\texttt{x $\leftarrow$ 2}};
\draw [-] (-6,0) rectangle (-3,-4);
\draw [-] (-6,-1) -- (-3,-1)
(-6,-2) -- (-3,-2)
(-6,-3) -- (-3,-3);
\node () [anchor=center] at (-4.5, 0.5) {s.b. 2};
\node () [anchor=center] at (-4.5,-0.5) {\color{frombuf}\texttt{x $\leftarrow$ 3}};
\node () [anchor=center] at (-4.5,-1.5) {\texttt{y $\leftarrow$ 3}};
\end{tikzpicture}
\begin{cppcode*}{linenos=false}
void thread0() {
int a = y;// →2
int b = x;
}
\end{cppcode*}
\vspace{-\medskipamount}
\caption{At this point, \texttt{x}~entries of store buffer 1 are marked as flushed (\textcolor{flushed}{orange}) and the \mbox{$\texttt{y} \leftarrow \texttt{1}$} entry was removed as it was succeeded by the used entry \mbox{$\texttt{y} \leftarrow \texttt{2}$}.
The thread had nondeterministically selected to load \texttt{x} from store buffer 2.
}\label{fig:mm:flushflagB}
\end{subfigure}
%
\hfill
%
\begin{subfigure}[t]{0.3\textwidth}
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
, semithick
, scale=0.5
, thck/.style = { thick, decoration={markings,mark=at position 1 with {\arrow[scale=4]{>}}}, postaction={decorate}, },
]
\draw [-] (-10,0) rectangle (-7,-4);
\draw [-] (-10,-1) -- (-7,-1)
(-10,-2) -- (-7,-2)
(-10,-3) -- (-7,-3);
\node () [anchor=center] at (-8.5, 0.5) {s.b. 1};
\draw [-] (-6,0) rectangle (-3,-4);
\draw [-] (-6,-1) -- (-3,-1)
(-6,-2) -- (-3,-2)
(-6,-3) -- (-3,-3);
\node () [anchor=center] at (-4.5, 0.5) {s.b. 2};
\node () [anchor=center] at (-4.5,-0.5) {\texttt{y $\leftarrow$ 3}};
\end{tikzpicture}
\begin{cppcode*}{linenos=false}
void thread0() {
int a = y;// →2
int b = x;// →3
}
\end{cppcode*}
\vspace{-\medskipamount}
\caption{
In the load of \texttt{x}, all \texttt{x} entries were evicted from the buffers -- all the flushed entries for \texttt{x} (which were not selected) had to be dropped before \mbox{$\texttt{x} \leftarrow \texttt{3}$} was propagated to the memory.
The last entry (\mbox{$\texttt{y} \leftarrow \texttt{3}$}) will remain in the store buffer if \texttt{y} will never be loaded in the program again.
}\label{fig:mm:flushflagC}
\end{subfigure}
\caption{An illustration of the \emph{delayed flushing}.}\label{fig:mm:flushflag}
\end{figure}
\paragraph{Program Transformation}%
%
\divine handles C and C++ code by translating it to \llvm and instrumenting it (see Figure \ref{fig:workflow} for \divine's workflow).
The support for relaxed memory is added in the instrumentation step, by replacing memory operations with calls to functions which simulate relaxed behaviour.
Essentially, all loads, stores, atomic compound instructions\mnote{Compare and swap, atomic exchange, and compound arithmetic and logic operations such as fetch and add}, and fences are replaced by calls to the appropriate functions.\mnote{With the exception of atomic compound operations, all modifications to the memory in LLVM must be performed by a series of instructions containing a load to a register, modifications of the register, and store of register value to the memory. Therefore it is sufficient to consider only this limited number of instructions as other instructions do not access memory at all.}
All of the \xtso-simulating functions are implemented so that they are executed atomically by \divine (i.e., not interleaved).
\paragraph{\xtso Memory Operations}
%
The most complex of these is the load operation.
It first finds all entries with overlap the loaded address (\emph{matching entries}) and out of these matching entries, it nondeterministically selects entries which will be written before the load (\emph{selected entries}).
All matching entries marked as flushed have to be selected for writing.
Similarly, all matching entries which occur in a store buffer before a selected entry also have to be selected.
Out of the selected entries, one is selected to be written last -- this will be the entry read by the load.
Next, selected entries are written, and all nonmatching entries which precede them are marked as flushed.
Finally, the load is performed, either from the local store buffer if matching
entry exists there, or from the shared memory.
The remaining functions are relatively straightforward.
Stores push a new entry to the store buffer, possibly evicting the oldest entry
from the store buffer if the store buffer exceeds its size bound.
Fences flush all entries from the store buffer of the calling thread to the
memory.
Atomic operations are basically a combination of a load, store, and a fence --
all atomics are sequentially consistent under \xtso.
The only intricate part of these operations is that if an entry is flushed out
of the store buffer, the entries from other store buffers which involve the
same memory location can also be nondeterministically flushed (to simulate
they could have been flushed before the given entry).
This flushing is similar to flushing performed in load.
An example which shows a series of loads can be found in~\autoref{fig:mm:flushflag}.
\paragraph{Correctness}%
%
We will now argue that this way of implementing \xtso is correct.
First, the nondeterminism in selecting entries to be flushed before a load serves the same purpose as the nondeterminism in the flusher thread of the more conventional implementation.
The only difference is that in the flusher-thread scenario the entries are
flushed in order, while in our new approach we are selecting only from the matching entries.
Therefore, the difference between the two approaches is only on those entries
which are not loaded by the load causing the flush, hence cannot be observed by the load.
However, any entry which would be flushed before the selected entries in the flusher-thread approach is now marked with the flushed flag.
This flag makes sure that such an entry will be flushed before an address which matches it is loaded, and therefore it behaves as if it was flushed.
This way, the in-thread store order is maintained.
\subsection{Stores to Freed Memory}
As \xtso simulation can delay memory stores, special care must be taken to preserve memory safety of the program.
More precisely, it is necessary to prevent the transformed program from writing into freed memory.
This problem occurs if a store to dynamically allocated memory is delayed after the memory is freed, or if a store to stack location is delayed after the corresponding function had returned.
This problem does not require special handling in normal program execution as both stack addresses as well as dynamic memory addresses remain to be writable for the program even after they are freed (except for memory-mapped files, but these have to be released by a system call which includes sufficiently strong memory barrier which prevents memory accesses from being delayed past the system call).
To solve the problem of freed memory, it is necessary to evict store buffer entries which correspond to the freed memory just before the memory is freed.
For entries not marked as flushed, this eviction concerns only store buffer of the thread which freed the memory.
If some other thread attempted to write to the freed memory, this is an error as there is insufficient synchronisation between the freeing and the store to the memory.
However, corresponding entries marked as flushed should be evicted from all store buffers, as these entries correspond to changes which should have been already written to the shared memory.
Eviction of dynamic memory is straightforward -- the program transformation injects a call to the eviction function just before every call to the function which releases memory.
For eviction of stack memory, it is necessary to evict all local addresses whenever a function exits, regardless of the way it exits.
This means we also have to take into account exceptions and other ways of performing non-local transfers of control (e.g., \texttt{longjmp}).
The program transformation takes care of tracking which local memory addresses should be evicted and inserts code to evict them at the end of functions.
\subsection{Integration with Other Parts of \divine}
The integration of \xtso simulation with the rest of \divine is straightforward in most cases.
No changes are required in the \divine's execution engine or state space exploration algorithms.
As for the libraries shipped with \divine, only minor tweaks were required.
The \texttt{pthread} implementation had to be modified to add full memory barrier both at the beginning and at the end of every synchronising functions.
This corresponds to barriers present in the implementations used for normal execution, \texttt{pthread} mutexes and other primitives have to guarantee sequential consistency of the guarded operations (provided all accesses are properly guarded).
The \divine's operating system, \dios, is used to implement low-level threading as well as simulation of various filesystem APIs \mcite{RBMKB2019}.
We had to add memory barrier into the system call entry which hands control to \dios.
\dios itself does not use relaxed memory simulation -- the implementation of \xtso operations detects that the code is executed in the kernel mode and bypasses store buffers.
In this way, the entire \dios executes as if under sequential consistency which makes its design simpler and more robust.
This synchronisation is easily justifiable -- system calls require a memory barrier or kernel lock in most operating systems.
\paragraph{Relaxed Memory and Abstract Data}
%
DIVINE has support for symbolic and abstract data (data nondeterminism) which
uses program transformation to implement symbolic manipulation of
data~\mcite{LRB2018}.
However, the integration of support for abstract data and for memory models is
not yet done as it requires more complex changes than just a straightforward
composition of the two program transformations.
Performing transformation for abstractions first is not possible currently, as
it introduces additional memory operations and therefore is not correct in
presence of relaxed memory.
On the other hand, it should be possible to run the transformation which
introduces relaxed memory behaviour first, and only after that run the
transformation which introduces abstraction.
However, this would mean that if an abstract value is ever stored to memory
(i.e., it is not only present in LLVM registers) then all values loaded from
the memory are potentially abstract, because the store buffers themselves need
to be abstract.
Therefore, this approach is not practical because it introduces too much extra
overhead.
Overall, we defer this problem to future work.
\subsection{Further Optimizations} \label{sec:mm:opt}
We have implemented two further optimisations of our \xtso simulation.
\paragraph{Static Local Variable Detection}
Accesses of local variables which are not accessible to other threads need not
use store buffering. For this reason, we have inserted a static analysis pass
which annotates accesses to local memory before the \xtso instrumentation. The
instrumentation ignores such annotated accesses. The static analysis can
detect most local variables which are never accessed using pointers.
\paragraph{Dynamic Local Memory Detection}
\divine can also dynamically detect if the given memory object is shared between threads (i.e., it is accessible from global variables or stacks of more then one thread).
Using this information, it is possible to dynamically bypass store buffers for operations with non-shared memory objects.
This optimisation is correct even though the shared status of memory can change during its lifetime.
A memory object $o$ can become shared only when its address is written to some memory object $s$ which is already shared (or $o$ can become shared transitively through a series of pointers and intermediate objects).
For this to happen, there has to be a store to the already shared object $s$, and this store has to be propagated to other threads.
Once the store of the address of $o$ is executed and written to the store buffer, $o$ becomes shared, and any newer stores into it will go through the store buffer.
Furthermore, once this store is propagated, any store which happened before turning $o$ into a shared object also had to be propagated as \xtso does not reorder stores.
Therefore, there is no reason to put stores to $o$ through the store buffer if $o$ is not shared.
This optimisation is not correct for memory models which allow store reordering -- for such memory models, we would need to know that the object will never be shared during its lifetime.
\subsection{Bounding the Size of Store Buffers}
The complexity of analysis of programs under the \xtso memory model is high.
From the theoretical point of view, we know due to~\mcite{wmdecidability} that reachability for programs with finite-state threads which run under TSO is decidable, but non-primitive recursive (it is in \textsc{pspace} for sequential consistency).
The proof uses the so-called SPARC TSO memory model~\mcite{SPARC94} that is very similar to \xtso. However, the proof of decidability does not translate well to an efficient decision procedure, and real-world programs are much more complex than the finite-state systems used in the decidability proof.
For this reason, we would need to introduce unbounded store buffers to properly
verify real-world programs. Unfortunately, this can be impractical, especially for programs which do not terminate.
Therefore, our program transformation
inserts store buffers of limited size, limiting thus the number of store
operations that can be delayed at any given time. The
size of the store buffers is fully configurable, and it currently defaults to
32, a value probably high enough to discover most bugs which can be observed on
real hardware.
Our implementation also supports the store buffers of unlimited size (when size
is set to 0). In this mode, programs with infinite loops that write into shared
memory will not have finite state space under \xtso even if they have finite
state space under sequential consistency.
Therefore, \divine will not terminate unless it discovers an error in the
program.
Verification with unbounded buffers will still terminate for terminating
programs and for all programs with errors.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Evaluation} \label{sec:evaluation}
The implementation is available at \url{http://divine.fi.muni.cz/2018/x86tso/}, together with information about how to use it.
It is also integrated into DIVINE releases.
We compared our implementation with the stateless model checker Nidhugg~\mcite{Abdulla2015} and the bounded model checker CBMC~\mcite{Clarke2004,Kroening2014}.
For evaluation we used SV-COMP 2017 benchmarks from the Concurrency
category~\mcite{SV-COMP:2017}, excluding benchmarks with data
nondeterminism\mnote{I.e., all the benchmarks which contain calls to
functions of the \texttt{\_\_VERIFIER\_nondet\_*} family were excluded.} as
DIVINE does not yet support data nondeterminism with relaxed memory.
Furthermore, due to the limitation of stateless model checking with DPOR,
Nidhugg cannot handle data nondeterminism at all.
There are 55 benchmarks in total.
The evaluation was performed on a machine with two dual core Intel Xeon 5130
processors running at 2 GHz with 16 GB of RAM. Each tool was running with memory
limit set to 10 GB and time limit set to 1 hour. The tools were not limited in
the number of CPUs they can use.
We have used CBMC version 5.8 with the option \texttt{-{}-mm tso}. Since there is no official release of Nidhugg, we have used version 0.2 from git, commit id
\texttt{375c554} with \texttt{-tso} option to enable relaxed memory support and inserted a definition of the \texttt{\_\_VERIFIER\_error} function.
For \divine, we have used the \texttt{-{}-svcomp} option to enable support for SV-COMP atomic sections (which are supported by default by CBMC and Nidhugg), and we disabled nondeterministic memory failure by using the \texttt{divine check} command (SV-COMP does not consider the possibility of allocation failure).
To enable \xtso analysis, \texttt{-{}-relaxed-memory tso} is used for \divine.\mnote{The complete invocation is \texttt{divine check -{}-svcomp -{}-relaxed-memory tso BENCH.c}.}
The buffer bound was the default value (32) unless stated otherwise.
\begin{table}[tt]
\caption{
Comparison of the default configuration of \divine with CBMC and Nidhugg.
} \label{tab:mm:default:svc}
\centering
\begin{tabular}{lrrr} \toprule
& CBMC & \hspace*{1ex} Nidhugg & \hspace*{1ex} \divine \\ \midrule
finished & 21 & 25 & 27 \\
TSO bugs & 3 & 3 & 9 \\
unique & 5 & 3 & 5 \\
\bottomrule
\end{tabular}
\end{table}
\autoref{tab:mm:default:svc} compares performance of the default configuration of \divine with the remaining tools.
The line ``finished'' shows the total number of benchmarks for which the verification task finished with the given limits.
From these the line ``TSO bugs'' shows the number of errors caused by relaxed memory in benchmarks which were not supposed to contain any bugs under sequential consistency.
All discovered errors were manually checked to really be observable under the \xtso memory model.
Finally, ``unique'' shows the number of benchmarks solved only by the given tool and not the other two. There were only 8 benchmarks solved by all three tools, all of them without any errors found.
\begin{table}[th]
\caption{
Comparison of various configurations of \divine.
The ``base'' version uses none of the improvements from \autoref{sec:mm:opt}.
The configurations marked with ``s'' add the static local variable optimisation, while the configurations marked with ``d'' add the dynamic detection of non-shared memory objects.
The ``+sdu'' configuration has both optimisations enabled and it has unbounded buffers.
Finally, the ``+sd4'' has buffer bound set to 4 entries instead of the default 32 entries.
The default version is ``+sd''.
} \label{tab:mm:divine:svc}
\centering
\begin{tabularx}{\textwidth}{lRRRRRR} \toprule
& base & +s & +d & +sd & +sdu & +sd4 \\ \midrule
finished & 26 & 26 & 27 & 27 & 27 & 27 \\
TSO bugs & 8 & 8 & 9 & 9 & 9 & 9 \\
\bottomrule
\end{tabularx}
\bigskip
\caption{
Comparison of various versions of \divine on benchmarks on the 26 which all the versions finished.
For the description of these versions, refer to \autoref{tab:mm:divine:svc}.
} \label{tab:mm:divine:time:svc}
\centering
\begin{tabularx}{\textwidth}{lRRRRRR} \toprule
& base & +s & +d & +sd & +sdu & +sd4 \\ \midrule
states & 252 k & 263 k & 250 k & 231 k & 206 k & 296 k \\
time & 2:14:49 & 2:17:13 & 1:09:23 & 1:05:05 & 0:58:28 & 1:24:59 \\
\bottomrule
\end{tabularx}
\end{table}
\autoref{tab:mm:divine:svc} shows effects of buffer size bound and improvements described in \autoref{sec:mm:opt}.
It can be seen that all versions perform very similarly, only one more benchmark was solved by the versions with dynamic shared object detection (the remaining solved benchmarks were the same for all versions).
The number of solved benchmarks remains the same regardless of used store buffer bound.
\autoref{tab:mm:divine:time:svc} offers more detailed look at the 26 benchmarks solved by all versions of \divine.
It shows the aggregate differences in state space sizes and solving times.
It can be seen that the dynamic shared object detection improves performance significantly.
Interestingly, we can see that of the 3 versions which differ only in store buffer size (``+sd'', ``+sdu'', and ``+sd4''), the unbounded version performs the best.
We expect this to be caused by the nondeterminism in flushing the excessive entries out of the store buffer when the bound is reached -- this can trigger flushing of matching entries from other store buffers and therefore increase nondeterminism.
\section{Related work}
A lot of techniques have support for program analysis under relaxed memory models, most of these were already described in \autoref{chap:stateoftheart} and therefore, we will not replicate the description here.
A lot of these techniques is build on stateless model checking (see also \autoref{sec:stateoftheart:smc}) including \mcite{Demsky2015,Zhang2015,Huang2016,Abdulla2017tso} for \xtso and \mcite{Abdulla2016,Norris2016,Kokologiannakis2017,Abdulla2018} for other memory models. Bounded model checking (see also \autoref{sec:stateoftheart:bmc}) is another popular choice for analysis of \xtso \mcite{Alglave2013po,Gavrilenko2019,Tomasco2017} and other memory models \mcite{Burckhardt2007,Abdulla2017}.
An explicit-state approach to programs running under the C\# relaxed memory model is presented in \mcite{Huang2016}.
There are also previous works which use program transformation in the analysis of relaxed memory models \mcite{Alglave2013,Abdulla2017,SRB15weakmem}.
Our approach to simulation of \xtso is somewhat similar to the approach presented in \mcite{Abdulla2018}. Both approaches perform nondeterministic choice primarily on memory loads. However, \cite{Abdulla2018} is focusing on the release-acquire fragment of C11 and does not support sequential consistency.
Similarly, \mcite{Kokologiannakis2017} presents a technique that attempts to minimise nondeterminism, this time in the analysis of a modified version of the C11 memory model.
There are also some approaches to theanalysis of relaxed memory models which were not mentioned in \autoref{chap:stateoftheart}, mostly because they lack implementation which works with realistic programming languages.
In \mcite{Atig2011} a program transformation that uses context switch bounding but not buffer bounding is presented. It uses additional copies for shared variables for TSO simulation. However, the evaluated was performed with manually converted programs.
An unbounded approach to verification of programs under TSO is presented in~\mcite{Linden2010}.
It uses store buffers represented by automata and leverages cycle iteration acceleration to get a representation of store buffers on paths which would form cycles if values in store buffers were disregarded.
It targets a modified Promela modelling language~\mcite{Holzmann1997}.
Another unbounded approach is presented in~\mcite{Bouajjani2015}.
It introduces TSO behaviours lazily by iterative refinement, and while it is not complete, it should eventually find all errors.
The work~\mcite{Dan2013} presents verification of (potentially infinite state space) programs under TSO and PSO (with bounded store buffers) using predicate abstraction. It first analyses the program under sequential consistency and then extrapolates the predicates to TSO/PSO and performs further analysis (and possibly refinement).
% CUPEX \mcite{Dan2013}
% - explores possibilities for combination of predicate abstraction and relaxed memory models (TSO, PSO)
% - new predictates are extrapolated from predicates used under SC
% - first verifies the program under SC, then constructs a modified program which encodes the memory model (runs of the modified program under SC simulate buffer-bounded runs of the original program under the given memory model), finally it discovers new predicates for the modified program and check it.
% - probably a modelling language
\paragraph{Absence of Relaxed Behaviour}
There are also techniques which aim to solve the problem of absence of relaxed behaviour in a given program.
For these methods, the question is whether a program, when running under a relaxed memory model, exhibits any runs not possible under sequential consistency.
This problem is explored under many names, e.g. (TSO-)safety~\mcite{Burckhardt2008}, robustness~\mcite{Bouajjani2013,Derevenetc2014}, stability~\mcite{Alglave2011}, and monitoring of sequential consistency~\mcite{Burnim2011}.
Similar techniques are used in \mcite{Yang2004} to detect data races in Java programs.
A related problem of correspondence between a parallel and sequential implementation of a data structure is explored in~\mcite{Ou2017}.
Some of these techniques can be also used to insert memory fences into the programs to recover sequential consistency.
Neither of these techniques are directly comparable to our method.
For these techniques, a program is incorrect if it exhibits relaxed behaviour, while for us, it is incorrect if it violates specification (e.g., assertion safety and memory safety).
In practice, the appearance of relaxed behaviour is often not a problem, provided the overall behaviour of the data structure or algorithm matches the desired specification.
In many lock-free data structures, relaxed behaviour is essential to achieving high performance.
\paragraph{Other Methods}\label{other-methods}
In~\mcite{Park1995}, the SPARC hierarchy of memory models (TSO, PSO, RMO) is modelled using encoding from assembly to Mur\(\varphi\)~\mcite{Murphi}.
A completely different approach is taken in~\mcite{Turon2014}.
This work introduces a separation logic GPS, which allows proving properties about programs using (a fragment of) the C11 memory model.
This work is intended for manual proving of properties of parallel programs, not for automatic verification.
The memory model used in this work is not complete; it lacks relaxed and release-consume accesses.
Another fragment of the C11 memory model is targeted by the RSL separation logic introduced in~\mcite{Vafeiadis2013}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion} \label{sec:conclusion}
We showed that by careful design of simulation of relaxed memory behaviour we
can use the standard model checker supporting only the sequential consistency to
efficiently detect relaxed memory errors in programs that are otherwise correct
under sequentially consistent memory. Moreover, according to our experimental
evaluation, our explicit-state model checking approach outperforms
a state-of-the-art stateless model checker as well as bounded model checker,
which is actually quite an unexpected result. We also show that many of the used
benchmarks can be solved only by one or two of the three evaluated tools, which
highlights the importance of employing different approaches to the analysis of programs
under relaxed memory. Finally, we show that for terminating programs, our
approach is viable both with bounded and unbounded store buffer size.
% vim: colorcolumn=80 expandtab sw=4 ts=4 spell spelllang=en