

## Program logics for weakly-consistent memory

Xavier Leroy

Collège de France, chair of software sciences xavier.leroy@college-de-france.fr

Sequential consistency: an idealized model for concurrent programming A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program (L. Lamport, 1978)

For a shared-memory system: the state of the shared memory is the result of an interleaving of the memory operations of the processes.

- 1. Time-sharing on a monoprocessor.
- 2. Multiprocessor system with a single "port" to access memory.



- 1. Time-sharing on a monoprocessor.
- 2. Multiprocessor system with a single "port" to access memory.



- 1. Time-sharing on a monoprocessor.
- 2. Multiprocessor system with a single "port" to access memory.



#### Reminder

Semantics for the  $c_1 \parallel c_2$  construct: an interleaving of the reductions of  $c_1$  and  $c_2$ .

 $\begin{array}{ll} (a_1 \parallel a_2)/h \to 0/h & (\text{or any combination of } a_1 \text{ and } a_2) \\ (c_1 \parallel c_2)/h \to (c_1' \parallel c_2)/h' & \text{if } c_1/h \to c_1'/h' \\ (c_1 \parallel c_2)/h \to (c_1 \parallel c_2')/h' & \text{if } c_2/h \to c_2'/h' \\ (c_1 \parallel c_2)/h \to \text{err} & \text{if } c_1/h \to \text{err or } c_2/h \to \text{err} \end{array}$ 

Our semantics for parallelism in PTR was already SC!

The SC model defines precisely the semantics of concurrent programs even when they contain race conditions.

This opens the way to concurrent algorithms that use race conditions in a well-controlled manner.

These algorithms are useful when the atomic instructions of the processor or the critical sections of the language are lacking or too expensive.

### Peterson's mutual exclusion algorithm

```
flag:bool[2] = \{false, false\}; turn: \{0, 1\};
```

```
flag[0] := true;
turn := 1;
while flag[1] \lapha turn = 1
do skip done;
// enter critical section
....
```

// leave critical section
flag[0] := false

flag[1] := true; turn := 0; while flag[0] \lapha turn = 0 do skip done; // enter critical section ... // leave critical section flag[1] := false

## Peterson's mutual exclusion algorithm

```
flag: bool[2] = \{ false, false \}; turn : \{0, 1\};
                                flag[1] := true;
flag[0] := true;
turn := 1;
                                turn := 0;
while flag[1] \wedge turn = 1
                                while flag[0] \wedge turn = 0
do skip done;
                                do skip done;
// enter critical section
                                // enter critical section
. . .
                                . . .
// leave critical section
                                // leave critical section
flag[0] := false
                                flag[1] := false
```

With an enumeration of SC interleavings, we can show that both processes are simultaneously in their critical sections only if  $flag[0] = flag[1] = true \land turn = 0 \land turn = 1$ , which is impossible.

## The ticket lock algorithm



A process that tries to enter the critical section

- takes the next ticket (atomic increment)
- waits for the number on its ticket to be displayed.

When it leaves the critical section, it ensures the next number is displayed.

Two global variables: next and now\_serving, initially 0.

```
lock() {
    int t = fetch_and_add(&next, 1); // atomic increment
    while (now_serving != t) pause(); // nonatomic read
}
unlock() {
    now_serving = now_serving + 1; // nonatomic increment
}
```

The increment of next must be atomic (otherwise two processes could get the same ticket).

Accesses to now\_serving need not be atomic.

The real world: weakly-consistent memory models

#### A fragment of Dekker's algorithm for mutual exclusion:

$$\begin{array}{c|c} \operatorname{set}(X,1); & \operatorname{set}(Y,1); \\ \operatorname{let} a = \operatorname{get}(Y) & \operatorname{let} b = \operatorname{get}(X) \end{array}$$

Initially, X = Y = 0.

In the SC model:

- either set(X, 1) executes first, and then b = 1 at the end;
- either set(Y, 1) executes first, and then a = 1 at the end.

Therefore, the final state a = b = 0 is impossible.

We write the test in x86 assembly (to get full control on the code):

We execute the test with the litmus7 tool:

```
Test Dekker Allowed
Histogram (4 states)
178 *>0:rax=0; 1:rax=0;
1999870:>0:rax=1; 1:rax=0;
1999881:>0:rax=0; 1:rax=1;
71 :>0:rax=1; 1:rax=1;
```

The sequential consistency model is respected

- neither by the modern hardware architectures (in order to provide faster memory subsystems).
- nor by optimizing compilers

   (in order to increase performance of generated code).

#### Hardware: write buffers, store buffers



(Higham, Jackson, Kawash, 2007)

Each processor puts its writes in a buffer while they are transmitted to the shared main memory.

Writes performed by a processor are immediately visible by this processor, but not immediately by the other processors.

## A non-SC execution

$$set(X, 1);$$
  $set(Y, 1);$   
let  $a = get(Y)$  let  $b = get(X)$ 

|                   | Processor 1               | Processor 2               |
|-------------------|---------------------------|---------------------------|
| Time $t = 0$      | puts $X \leftarrow 1$     | puts $Y \leftarrow 1$     |
|                   | in its buffer             | in its buffer             |
| Time $t = 1$      | reads $Y = 0$ from        |                           |
|                   | main memory               |                           |
| Time <i>t</i> = 2 |                           | reads $X = 0$ from        |
|                   |                           | main memory               |
| Time $t = 3$      | sends $X \leftarrow 1$ to |                           |
|                   | main memory               |                           |
| Time $t = 4$      |                           | sends $Y \leftarrow 1$ to |
|                   |                           | main memory               |

The processor can reorder instructions on the fly, so as to start long-running instructions earlier (e.g. memory reads).

write  $X; \ldots$ ; read  $Y \rightarrow$  read Y; write  $X; \ldots$ 

This out-of-order execution is often speculative: if the processor realizes that X = Y, it cancels the anticipated read from Y, or satisfies it with the value written to X (forwarding).

#### Machine code:

$$set(X,1); \qquad set(Y,1);$$
  
let  $a = get(Y) \qquad let b = get(X)$ 

Code actually executed by the processor after on-the-fly reordering:

$$\begin{array}{c|c} \text{let } a = \text{get}(Y) \text{ in } \\ \text{set}(X, 1) \end{array} \quad \begin{array}{c|c} \text{let } b = \text{get}(X) \text{ in } \\ \text{set}(Y, 1) \end{array}$$

The reordered code can obviously terminate with a = b = 0.

A misaligned datum can span two cache lines, requiring two memory accesses per access to the datum.

$$set(X, 0x12345678) \| let a = get(X)$$

can be executed like

| $a_{0}+(X, 0x1234)$ | let $a_1 = get(X_1)$ in      |
|---------------------|------------------------------|
| $Set(X_1, 0X1234),$ | let $a_2 = get(X_2)$ in      |
| $Set(X_2, 0X3070),$ | let $a = a_1 << 16 \mid a_2$ |

$$set(X, 0x12345678) \parallel let a = get(X)$$

Starting from X = 0, we have two SC executions: a = 0 or a = 0x12345678.

$$set(X_1, 0x1234);$$
 | let  $a_1 = get(X_1)$  in  
 $set(X_2, 0x5678);$  | let  $a_2 = get(X_2)$  in  
let  $a = a_1 << 16 \mid a_2$ 

After splitting the memory accesses, a third result is possible: a = 0x12340000, coming from  $a_1 = 0x1234$  and  $a_2 = 0$ .

Note: the value 0x12340000 appears nowhere in the initial code. It appears out of thin air! Use barrier instructions that prevent the processor from reordering certain memory accesses:

- "Strong" barrier: preserves ordering between accesses before the barrier and accesses after the barrier.
- "Weak" barrier: preserves ordering between reads before the barrier and accesses after the barrier.

Other instructions with special memory behaviors:

- "locked" instructions (the x86 lock prefix);
- load-acquire and store-release (Itanium, ARM);
- etc.

The compiler can reorder independent reads and writes (at addresses *X*, *Y* that are guaranteed to be different).

Typically, reads are anticipated while writes are delayed.

```
write X; \ldots; read Y \rightarrow read Y; write X; \ldots
```

 $\rightarrow$  same non-SC behaviors as dynamic reordering by the processor.

#### A special case of common subexpression elimination (CSE).

let a = get(X) inlet a = get(X) in......(no writes to X) $\rightsquigarrow$ ......let b = get(X) inlet b = a in......

let 
$$a = get(X)$$
 in  $set(X, 1);$   
let  $b = get(Y)$  in  $set(Y, 1);$   
let  $c = get(X)$ 

With X = Y = 0 initially, no SC execution terminates with (a, b, c) = (0, 1, 0).

After factoring of get(X), we have let c = a and the result (a, b, c) = (0, 1, 0) is possible.

## Compiler optimizations: loop-invariant code motion

A computation performed repeatedly at each loop iteration can be performed once before the loop:

 $\begin{array}{ll} t := j \times 10; \\ \text{for } i = 0 \text{ to } 99 \text{ do} & \rightsquigarrow & \text{for } i = 0 \text{ to } 99 \text{ do} \\ A[i] := i + j \times 10 & & A[i] := i + t \\ \text{done} & & \text{done} \end{array}$ 

This can break codes based on busy waiting:

t := get(X);do  $\rightsquigarrow$  do t := get(X) skip while t = 0 while t = 0 Turn optimizations off? Never!

Inform the compiler of which memory accesses implement inter-process communications, so as to compile them specially:

- the volatile modifier (C,C++, Java)
- a library of low-level atomic operations (C/C++ 2011)
- etc.

Various atomic operations: read, write, fetch-and-add, compare-and-swap, ....

Each operation is annotated with the consistency model expected by the programmer:

```
memory_order_seq_cst
memory_order_acq_rel
memory_order_acquire
memory_order_release
memory_order_consume
memory_order_relaxed
```

sequential consistency

just enough for message passing

no guarantees beyond atomicity

## $\mathsf{DRF} + \mathsf{CSL} = \heartsuit$

# Concurrent separation logic and the DRF guarantee

A property of a relaxed memory model:

If a program executes in the SC model without race conditions,

then it executes in the relaxed model exactly like in the SC model.

In other words: for a program free of race conditions, the relaxations of the memory model do not add more behaviors beyond those permitted by SC.

This "DRF guarantee" seems to hold / is claimed to hold for all the known memory models (hardware models + language models).

If a program comprising critical sections, atomic operations, and other synchronization devices executes in the SC model without race conditions,

then it executes in the relaxed model exactly like in the SC model,

provided the synchronization devices are correctly implemented.

"Correctly implemented" = with enough memory barriers and special instructions to rule out non-SC behaviors.

## Examples: implementing locks

| x86:                            | : lock                                                                         |       | unlock                      |      |  |
|---------------------------------|--------------------------------------------------------------------------------|-------|-----------------------------|------|--|
| m<br>.L2: m<br>x<br>t<br>j      | ovl \$1, %edx<br>ovl %edx, %eax<br>chgb (%rdi), %al<br>estb %al, %al<br>ne .L2 |       | movb                        | \$0, |  |
| Power:                          | lock                                                                           | unloc | :k                          |      |  |
| .L2: 1<br>s<br>b<br>i<br>a<br>b | barx 9,0,3<br>tbcx. 10,0,3<br>ne 0,.L2<br>sync<br>ndi. 9,9,0xff<br>ne 0,.L2    |       | lwsync<br>li 9,0<br>stb 9,0 | (3)  |  |

(%rdi)

## A subtle point of the x86 implementation

Before 1999, the Linux kernel implemented unlock with an atomic instruction

```
lock; btr $0, (...)
```

instead of a nonatomic write

```
movb $0, (...)
```

A long discussion concluded that a nonatomic write is enough, because the x86 memory model is TSO.

The write of 0 in the lock is not immediately visible by other processes waiting for the lock. But when it becomes visible, all preceding writes are already visible, and all preceding reads have been performed. A great many compiler optimizations are sound for programs that contain no race conditions.

(Sound = all behaviors of the optimized program are possible behaviors of the original program. Optimization did not introduce additional behaviors.)

## Compatibility with compiler optimizations

| Transformation                           | SC                                      | DRF guarantee | JMM          |
|------------------------------------------|-----------------------------------------|---------------|--------------|
| Trace-preserving transformations         | $\checkmark$                            | $\checkmark$  | $\checkmark$ |
| Reordering normal memory accesses        | ×                                       | $\checkmark$  | ×            |
| Redundant read after read elimination    | $\checkmark$                            | $\checkmark$  | ×            |
| Redundant read after write elimination   | $\checkmark$                            | $\checkmark$  | $\checkmark$ |
| Irrelevant read elimination              | $\checkmark$                            | $\checkmark$  | $\checkmark$ |
| Irrelevant read introduction             | $\checkmark$                            | ?             | ×            |
| Redundant write before write elimination | $\checkmark$                            | $\checkmark$  | $\checkmark$ |
| Redundant write after read elimination   | $\checkmark$                            | $\checkmark$  | ×            |
| Roach-motel reordering                   | $\times (\checkmark {\rm for \ locks})$ | $\checkmark$  | ×            |
| External action reordering               | ×                                       | $\checkmark$  | ×            |

J. Ševčik, Program Transformations in Weak Memory Models, PhD, 2008.
The assumption "the program has no race conditions in the SC model" is strong! How to establish it?

- Ad-hoc proof.
- Type system (pprox Rust).
- Static analysis (Infer, etc).
- Deductive verification in concurrent separation logic!

Reminder (lecture #4): if  $J \vdash \{P\} c \{Q\}$ , then c executes without race conditions in an interleaving semantics that is equivalent to the SC model.

If a program is provable in concurrent separation logic (including critical sections, atomic sections, etc),

then it executes correctly in a relaxed memory model that respects the DRF guarantee,

provided that critical sections and atomic sections are correctly implemented.

We now show semantic soundness for concurrent separation logic in a variant of our PTR language extended with write buffers (TSO model):

Write buffers:  $s ::= \varepsilon \mid (\ell, v) \cdot s$ 

We consider whole-program configurations

$$((c_1/s_1) \parallel \cdots \parallel (c_n/s_n)) / h$$

composed of *n* processes  $c_1 \dots c_n$ , each with its own buffer  $s_i$ , plus a global heap *h*.

We also consider local, per-process configurations

c/s/h

At any time a process can perform the oldest write from its buffer:

$$c/s \cdot (\ell, v)/h \to c/s/h[\ell \leftarrow v]$$

The base language constructs have their usual semantics:

$$(\operatorname{let} x = a \operatorname{in} c)/s/h \to c[x \leftarrow \llbracket a \rrbracket]/s/h$$
$$(\operatorname{let} x = c_1 \operatorname{in} c_2)/s/h \to (\operatorname{let} x = c'_1 \operatorname{in} c_2)/s'/h'$$
$$\operatorname{if} c_1/s/h \to c'_1/s'/h'$$
$$(1 \to x = c \operatorname{in} c_2)/s/h \to c_1/s'/h'$$

 $(\operatorname{let} x = c_1 \operatorname{in} c_2)/s/h \to \operatorname{err} \operatorname{if} c_1/s/h \to \operatorname{err}$ 

Imperative constructs write to s (the buffer) and read from  $s \triangleright h$ , the heap *h* updated as described by s:

$$\varepsilon \rhd h = h$$
  $((\ell, \mathbf{v}) \cdot \mathbf{s}) \rhd h = (\mathbf{s} \rhd h)[\ell \leftarrow \mathbf{v}]$ 

 $get(a)/s/h \to (s \rhd h)(\llbracket a \rrbracket)/s/h \quad \text{if } \llbracket a \rrbracket \in Dom(s \rhd h)$  $set(a, a')/s/h \to 0/(\llbracket a \rrbracket, \llbracket a' \rrbracket) \cdot s/h \quad \text{if } \llbracket a \rrbracket \in Dom(s \rhd h)$  $get(a)/s/h \to \text{err} \quad \text{if } \llbracket a \rrbracket \notin Dom(s \rhd h)$  $set(a, a')/s/h \to \text{err} \quad \text{if } \llbracket a \rrbracket \notin Dom(s \rhd h)$ 

As in PTR, atomic sections execute in a single "big step":

$$\operatorname{atomic}(c)/s/h \to a/\varepsilon/h'$$
 if  $c/s/h \xrightarrow{*} a/\varepsilon/h$   
 $\operatorname{atomic}(c)/s/h \to \operatorname{err}$  if  $c/s/h \xrightarrow{*} \operatorname{err}$ 

However, the buffer must be empty at the end of the atomic section ( $\approx$  there is a write barrier at the end).

When we create a resource invariant, the buffer must also be empty, hence the mkinv(c) construct:

 $\mathtt{mkinv}(c)/\varepsilon/h \to c/\varepsilon/h$ 

At each step we locally reduce one of the processes  $c_i/s_i$  from the parallel composition; the other processes are unchanged.

 $c_i/s_i/h 
ightarrow c'/s'/h'$ 

 $(\cdots \parallel (c_i/s_i) \parallel \cdots) / h \rightarrow (\cdots \parallel (c'/s') \parallel \cdots) / h'$   $c_i/s_i/h \rightarrow \texttt{err}$ 

 $(\cdots \parallel (c_i/s_i) \parallel \cdots) \ / \ h 
ightarrow ext{err}$ 

For PTR in the SC model, we decomposed the current heap *h* in three disjoint parts:

 $h = h_1 \uplus h_j \uplus h_f$ 

 $h_1$  is the private memory for *c*.

 $h_i$  is the shared memory accessible to atomic sections.

 $h_f$  is the "frame" memory, including the private memories of the processes that execute in parallel with c.

For PTR in the TSO model, we decompose the main heap *h* and the buffer s for the current process *c* as follows:

$$h = h_u \uplus h_j$$
  $s \rhd h_u = h_1 \uplus h_f$ 

The main heap h decomposes into shared memory  $h_j$  and unshared memory  $h_u$ .

The unshared memory  $h_u$ , updated according to the buffer s, decomposes into  $h_1$ , the private memory for c, and  $h_f$ , the frame.

Equivalent presentation:

$$s \rhd h = h_1 \uplus h_j \uplus h_f$$
 with  $Dom(s) \cap Dom(h_j) = \emptyset$ 

We define the semantic triple  $J \models \{\{P\}\} c \{\{Q\}\}\$ as follows:

$$J \models \{\!\{ P \}\!\} \mathsf{c} \{\!\{ Q \}\!\} \stackrel{def}{=} \forall n, h, P h \Rightarrow \texttt{Safe}^n \mathsf{c} h Q J$$

As in lecture #4, we have:

Safe<sup>0</sup> c h Q J 
$$\frac{Q \llbracket a \rrbracket h}{\operatorname{Safe}^{n+1} a h Q J} \qquad \frac{(\forall a, c \neq a) \cdots}{\operatorname{Safe}^{n+1} c h Q J}$$

The recursive case: one reduction step for c/s/h.

 $\forall a, c \neq a$ 

 $\forall s, h, h_j, h_f, s \rhd h = h_1 \uplus h_j \uplus h_f \land Dom(s) \cap Dom(h_j) = \emptyset \land J h_j \Rightarrow c/s/h \not\rightarrow err$ 

$$\begin{aligned} \forall s, h, h_j, h_f, c', s', h', \\ s \triangleright h &= h_1 \uplus h_j \uplus h_f \land \textit{Dom}(s) \cap \textit{Dom}(h_j) = \emptyset \\ \land J h_j \land c/s/h \to c'/s'/h' \\ &\Rightarrow \exists h'_1, h'_j, \ s' \triangleright h' &= h'_1 \uplus h'_j \uplus h_f \land \textit{Dom}(s') \cap \textit{Dom}(h'_j) = \emptyset \\ \land J h'_j \land \textit{Safe}^n c' h'_1 Q \end{aligned}$$

 $\operatorname{Safe}^{n+1} c h_1 Q$ 

It remains to show that this semantic triple  $J \models \{\{P\}\} c \{\{Q\}\}\$ validates the rules of concurrent separation logic. The two interesting cases deal with atomic sections.

$$\mathsf{emp} \vdash \{\mathsf{P} \bigstar \mathsf{J}\} \mathsf{c} \{\lambda \mathsf{V}. \mathsf{Q} \mathsf{v} \bigstar \mathsf{J}\}$$

 $J \vdash \{P\}$  atomic  $c \{Q\}$ 

At the end of the execution of c we have a decomposition  $\varepsilon \rhd h' = (h'_1 \uplus h'_j) \uplus \emptyset \uplus h_f$  that we rewrite as  $\varepsilon \rhd h' = h'_1 \uplus h'_j \uplus h_f$ . It is crucial that the final buffer s' is  $\varepsilon$ , otherwise the constraint  $Dom(s') \cap Dom(h'_j) = \emptyset$  could not be satisfied. Consider adding an invariant J' to J:

 $J \bigstar J' \vdash \{P\} c \{Q\}$ 

 $J \vdash \{ P \bigstar J' \} \texttt{mkinv} c \{ \lambda v. Q v \bigstar J' \}$ 

At the beginning of the execution, we have a decomposition  $s \triangleright h = (h_1 \uplus h'_j) \uplus h_j \uplus h_f$  that we rewrite as  $s \triangleright h = h_1 \uplus (h_j \uplus h'_j) \uplus h_f$ .

Here too we must enforce  $s = \varepsilon$  to satisfy  $Dom(s) \cap Dom(h_j \uplus h'_j) = \emptyset$ .

The mkinv construct forces the buffer to be empty.

A limitation: our formalization fails to justify the implementation of  $unlock(\ell) = atomic(set(\ell, 0))$  by a normal write, without flushing the store buffer.

This is an aspect of TSO that our formalization does not capture.

A strength: this makes our proof reusable for memory models that are more relaxed than TSO, in particular PSO (*Partial Store Ordering*), where this optimization of unlock is invalid.

In the PSO model, writes to different locations can be reordered, and therefore "leave" the write buffer in a different order than execution order:

$$c/s_1 \cdot (\ell, v) \cdot s_2/h \to c/s_1 \cdot s_2/h[\ell \leftarrow v] \quad \text{if } \ell \notin \text{Dom}(s_2)$$

Here, the write to  $\ell$  "overtakes" the writes in  $s_2$ .

This makes no difference for the soundness proof of the logic, since

$$(s_1 \cdot (\ell, v) \cdot s_2) \rhd h = (s_1 \cdot s_2) \rhd (h[\ell \leftarrow v]) \text{ if } \ell \notin Dom(s_2)$$

# A logic for release-acquire

A write marked "release" guarantees that all preceding reads and writes have been performed.

(No reordering of *X*; *W*<sub>rel</sub> into *W*<sub>rel</sub>; *X*.)

A read marked "acquire" guarantees that all following reads and writes have not started yet.

(No reordering of *R*<sub>acq</sub>; *X* into *X*; *R*<sub>acq</sub>.)

// preparing the message
// nonatomic (na) writes
set\_na(msg,...);
set\_na(msg + 1,...);
set\_na(msg + 2,...);
// sending the message
set\_rel(ready, 1)

// waiting for the message
// read ready until ≠ 0
while get<sub>acq</sub>(ready) = 0 do skip;
// accessing the message
// nonatomic reads
let x = get<sub>na</sub>(msg) in
...

A lightweight form of synchronization and resource transfer, without mutual exclusion.

Unlocking a lock is just one release write:

```
unlock(\ell) = set_{rel}(\ell, 0)
```

Locking a lock requires an atomic instruction such as Compare And Swap, marked "acquire":

$$\mathit{lock}(\ell) = \texttt{while}\left(\texttt{CAS}_{\mathit{acq}}(\ell, 0, 1) 
eq 0
ight)$$
 do skip

We can improve performance with a busy-wait loop using relaxed reads:

$$spin(\ell) = ext{while} ( ext{get}_{rlx}(\ell) 
eq 0) ext{ do skip} \ lock(\ell) = ext{do spin}(\ell) ext{ while} ( ext{CAS}_{acq}(\ell, 0, 1) 
eq 0)$$

For TSO architectures like x86: nothing to do!

- Ordinary writes have release semantics.
- Ordinary loads have acquire semantics.

For more relaxed architectures like Power and ARM:

• Memory barriers that are less costly than the barriers needed to guarantee SC.

A concurrent separation logic for a fragment of the low-level atomics from C/C++ 2011.

pure expression c ::= asequencing and binding let X = C in C'conditional if a then  $c_1$  else  $c_2$ repeat C repeat until not 0 parallel execution  $|\mathbf{C}_1 \| \mathbf{C}_2$ alloc() allocation  $get_{x}(a)$ memory read  $set_Y(a, a')$ memory write  $CAS_{Z,X}(a, a', a'')$ Compare And Swap  $X ::= sc \mid acq \mid rlx \mid na$ type of read Y ::= sc | rel | rlx | na type of store  $Z ::= sc | rel_acq | acq | rel$ type of CAS

### Assertions, preconditions:

Postconditions, resource invariants:

 $Q, \Phi ::= \lambda v. P$ 

#### Nonatomic reads and writes follow standard separation logic:

$$\{ emp \} alloc() \{ \lambda \ell. Uninit(\ell) \} \\ \{ \ell \stackrel{\pi}{\mapsto} \mathbf{v} \} get_{na}(\ell) \{ \lambda x. \langle x = \mathbf{v} \rangle \bigstar \ell \stackrel{\pi}{\mapsto} \mathbf{v} \} \\ \{ Uninit(\ell) \lor \ell \stackrel{1}{\mapsto} _{-} \} set_{na}(\ell, \mathbf{v}) \{ \lambda_{-}. \ell \stackrel{1}{\mapsto} \mathbf{v} \}$$

(The role of *Uninit* is to prevent us from reading from a freshly allocated memory location that has not been initialized yet.)

 $Rel(\ell, \Phi)$  grants permission to write to location  $\ell$  a value v provided we have the resource  $\Phi v$ .

 $Acq(\ell, \Phi)$ , in conjunction with  $Init(\ell)$ , grants permission to read from location  $\ell$ , obtaining a value v and the resource  $\Phi v$ .

$$\{ emp \} alloc() \{ \lambda \ell. Rel(\ell, \Phi) * Acq(\ell, \Phi) \}$$
$$\{ Rel(\ell, \Phi) * \Phi v \} set_{rel}(\ell, v) \{ Rel(\ell, \Phi) * Init(\ell) \}$$
$$\{ Acq(\ell, \Phi) * Init(\ell) \} get_{acq}(\ell) \{ \lambda v. \Phi v * Acq(\ell, \Phi[v \leftarrow emp]) \}$$

We can read the same value multiple times, but the second and subsequent reads transfer no resources:

$$\Phi[\mathsf{v} \leftarrow \mathtt{emp}] \stackrel{\mathit{def}}{=} \lambda \mathsf{v}'. \ \mathtt{if} \ \mathsf{v}' = \mathsf{v} \ \mathtt{then} \ \mathtt{emp} \ \mathtt{else} \ \Phi \ \mathsf{v}'.$$

We have a (nonatomic) buffer *b* and an (atomic) flag *x*. We take  $\Phi = \lambda v$ . if v = 0 then emp else  $b \stackrel{1}{\mapsto} 53$ .

 $let x = alloc() in let b = alloc() in set_{rel}(x, 0);$  $\{ Uninit(b) * Rel(x, \Phi) * Init(x) * Acq(x, \Phi) \}$ 

 $\{ Uninit(b) * Rel(x, \Phi) \}$ set<sub>na</sub>(b, 53);  $\{ b \stackrel{1}{\mapsto} 53 * Rel(x, \Phi) \}$  $\Rightarrow \{ \Phi 1 * Rel(x, \Phi) \}$ set<sub>rel</sub>(x, 1)

 $\{ Init(x) * Acq(x, \Phi) \}$ repeat get<sub>acq</sub>(x);  $\{ \exists v \neq 0, \ \Phi \ v \} \Rightarrow \{ b \stackrel{1}{\mapsto} 53 \}$ let  $n = get_{na}(b)$  in  $\{ b \stackrel{1}{\mapsto} 53 * \langle n = 53 \rangle \}$ 

Write permissions can be duplicated:

 $Init(\ell) = Init(\ell) * Init(\ell)$   $Rel(\ell, \Phi) = Rel(\ell, \Phi) * Rel(\ell, \Phi)$ 

Read permissions can be split:

$$Acq(\ell, \lambda v. \Phi_1 v \star \Phi_2 v) = Acq(\ell, \Phi_1) \star Acq(\ell, \Phi_2)$$

Example: one writer, two readers.

$$\begin{array}{c|c} \operatorname{set}_{na}(a,13);\\ \operatorname{set}_{na}(b,17);\\ \operatorname{set}_{rel}(x,1);\\ \end{array} \begin{array}{c|c} \operatorname{repeat} \operatorname{get}_{acq}(x);\\ \left\{ a \stackrel{1}{\mapsto} 13 \right\} \end{array} \begin{array}{c} \operatorname{repeat} \operatorname{get}_{acq}(x);\\ \left\{ b \stackrel{1}{\mapsto} 17 \right\} \end{array}$$

 $\{ emp \} alloc() \{ \lambda \ell. Rel(\ell, \Phi) \star RMWAcq(\ell, \Phi) \}$ 

RMWAcq permissions can be duplicated:

 $\mathsf{RMWAcq}(\ell, \Phi) \star \mathsf{RMWAcq}(\ell, \Phi) = \mathsf{RMWAcq}(\ell, \Phi)$ 

The rule for CAS<sub>X,rlx</sub>:

 $P \Rightarrow Init(\ell) * RMWAcq(\ell, \Phi) * true$   $P * \Phi v \Rightarrow Rel(\ell, \Psi) * \Psi v' * R 1$   $P \Rightarrow R 0$   $X \in \{rel, rlx\} \Rightarrow \Phi v = emp \quad X \in \{acq, rlx\} \Rightarrow \Psi v' = emp$ 

 $\{P\} CAS_{X,rlx}(\ell, \mathbf{v}, \mathbf{v}') \{R\}$ 

A logic for relaxed accesses

Intuition: like a release write or an acquire load, but without resource transfer.

$$\{\operatorname{\mathsf{Acq}}(\ell, \Phi)\}\operatorname{\mathsf{get}}_{\operatorname{\mathsf{rlx}}}(\ell)$$
  $\{\lambda v. \langle \Phi v \neq \mathtt{false} \rangle\}$ 

 $\Phi v = emp$  (i.e.  $\Phi v$  is pure and true)

 $\{ \operatorname{Rel}(\ell, \Phi) \} \operatorname{set}_{\operatorname{rlx}}(\ell, v) \{ \operatorname{Rel}(\ell, \Phi) \}$ 

A modest application: control the set of all possible values for location  $\ell$ , for instance  $\Phi = \lambda v$ .  $\langle 0 \le v \le 10 \rangle$ .

Vafeiadis & Narayan point out that these rules are incorrect for C/C++ 2011, since relaxed accesses are allowed to produce values out of thin air.

$$\begin{array}{c|c} \operatorname{let} a = \operatorname{get}_{rlx}(X) \text{ in } & \operatorname{let} b = \operatorname{get}_{rlx}(Y) \text{ in } \\ \operatorname{set}_{rlx}(Y, a) & \operatorname{set}_{rlx}(X, b) \end{array}$$

Starting with X = Y = 0, we can (according to the C/C++ 2011 standard) end with X = Y = 1.

According to Vafeiadis & Narayan's rules,  $\Phi = \lambda v$ .  $\langle v = 0 \rangle$  is a correct invariant for X and for Y.

A major risk: they break type safety and open security holes.

A theoretical risk: no known architecture or compiler exhibits "out of thin air" behaviors.

A specification problem: axiomatic definitions (using event structures) of C11-style memory models have a hard time distinguishing between

- behaviors involving values out of thin air;
- speculative behaviors (of the "load buffering" kind) that are correct.

Kang et al (2017) describe an operational semantics for C/C++ 2011 atomics, of the (simplified) shape below.

1- Shared memory M = a set of write messages

A message is  $\langle \ell : v @ t \rangle$ , representing the write of value v at location  $\ell$  at timestamp t.

There is at most one message  $\langle \ell : v @ t \rangle$  for a given  $\ell$  and a given t.

Kang et al (2017) describe an operational semantics for C/C++ 2011 atomics, of the (simplified) shape below.

- 1- Shared memory M = a set of write messages
- 2- A process = a view V of the shared memory ...

A view = a function location  $\rightarrow$  time at which the contents of this location was observed most recently.

Reading from location  $\ell$  = observing a message  $\langle \ell : v @ t \rangle$  with  $t \geq V(\ell)$ .

Writing v to location  $\ell$  = sending a message  $\langle \ell : v @ t \rangle$  with

 $t > V(\ell)$  a fresh timestamp.

In both cases, V is updated to  $V[\ell \leftarrow t]$ .

## A promising semantics

Kang et al (2017) describe an operational semantics for C/C++ 2011 atomics, of the (simplified) shape below.

- 1- Shared memory M = a set of write messages
- 2- A process = a view V of the shared memory ...
- 3-...plus a set P of promises.

A promise = a speculative write = a message already in the shared memory, but which still needs to be realized by an actual write later in the process execution. Kang et al (2017) describe an operational semantics for C/C++ 2011 atomics, of the (simplified) shape below.

- 1- Shared memory M = a set of write messages
- 2- A process = a view V of the shared memory ...
- 3-...plus a set P of promises.

Invariant enforced at each reduction step: it is always possible to reduce so as to realize all promises. This prevents out-of-thin-air behaviors.

$$(c_1, V_1, P_1)/M_1 \longrightarrow (c_2, V_2, P_2)/M_2 \longrightarrow \cdots$$

## SLR: a separation logic for the promising semantics

(Svendsen et al, 2018.)

An extension of RSL with extra assertions:

 $O(\ell, \mathbf{v}, \mathbf{t}) \qquad (\text{generalizes } Init(\ell))$ I observed value v in location  $\ell$  at time t.  $W^{\pi}(\ell, X) \qquad (\text{generalizes } \ell \stackrel{\pi}{\mapsto} v)$ 

I have permission  $\pi$  on location  $\ell$ .

 $X = \{(v_1, t_1), \dots, (v_n, t_n)\}$  is a set of timestamped writes to this location.

If  $\pi = 1$  (exclusive permission), X contains all the writes to  $\ell$  ever performed.

If  $\pi <$  1, X is a subset of these writes.
The assertion can be split:

$$W^{\pi_1+\pi_2}(\ell, X_1 \cup X_2) = W^{\pi_1}(\ell, X_1) * W^{\pi_2}(\ell, X_2)$$

Writes are consistent (unique value for a given timestamp):

$$W^{\pi}(\ell, X) \bigstar \langle (v, t) \in X \land (v', t') \in X \land v \neq v' \rangle \Rightarrow W^{\pi}(\ell, X) \bigstar \langle t \neq t' \rangle$$

All writes are observed:

$$W^{\pi}(\ell, X) * \langle (\mathbf{v}, t) \in X \rangle \Rightarrow W^{\pi}(\ell, X) * O(\ell, \mathbf{v}, t)$$

The converse is true if the permission is exclusive:

$$W^{1}(\ell, X) * O(\ell, v, t) \Rightarrow W^{1}(\ell, X) * O(\ell, v, t) * \langle (v, t) \in X \rangle$$

 $\Phi v = \exp$  (i.e.  $\Phi v$  is pure and true)

$$\left\{\begin{array}{c} W^{\pi}(\ell, X) \\ * \operatorname{Rel}(\ell, \Phi) \\ * O(\ell, \_, t) \end{array}\right\} \operatorname{set}_{rlx}(\ell, v) \left\{\begin{array}{c} \lambda_{\_} \exists t' > t, \\ W^{\pi}(\ell, \{(v, t')\} \cup X) \end{array}\right\}$$

As in RSL, a relaxed write transfers no resources ( $\Phi v = emp$ ).

The write is reflected in assertion  $W^{\pi}(\ell, X)$ , which does not need to be exclusive. ( $\pi < 1$  is allowed!)

 $O(\ell, ..., t)$  proves that  $\ell$  is initialized and gives a lower bound for the new timestamp t'.

$$\left\{ \begin{array}{l} Acq(\ell, \Phi) \\ * O(\ell, .., t) \end{array} \right\} get_{rlx}(\ell) \left\{ \begin{array}{l} \lambda v. \ \exists t' \ge t, \\ Acq(\ell, \Phi) * O(\ell, v, t') * \nabla(\Phi v) \end{array} \right.$$

A relaxed read of value v gives access to the pure part  $\nabla(\Phi v)$  of the resource invariant  $\Phi v$ , and to a new observation  $O(\ell, v, t')$ .

If we own the full permission on  $\ell$ , the value read is determined by the most recent write.

$$\left\{ \begin{array}{c} Acq(\ell, \Phi) \\ \star W^{1}(\ell, X) \end{array} \right\} \operatorname{get}_{rlx}(\ell) \left\{ \begin{array}{c} \lambda v. \ \exists t, \langle (v, t) = \max(X) \rangle \\ \star Acq(\ell, \Phi) \star W^{1}(\ell, X) \star \nabla(\Phi v) \end{array} \right\}$$

## **Summary**

A realization: relaxed memory models such as those of Java or C/C++ are complicated and not fully understood yet.

There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies (C. A. R. Hoare) A realization: relaxed memory models such as those of Java or C/C++ are complicated and not fully understood yet.

A hope: it affects a handful of libraries only; the bulk of parallel computation codes are still written using conventional synchronization primitives. A realization: relaxed memory models such as those of Java or C/C++ are complicated and not fully understood yet.

A hope: it affects a handful of libraries only; the bulk of parallel computation codes are still written using conventional synchronization primitives.

A most necessary tool: program logics!

- To abstract over some of the complexity of the memory model (cf. RSL, SLR).
- To combine reasoning in standard separation logic with reasoning specific to a given memory model.

## References

## References

An introduction to weakly-consistent memory models:

• S. V. Adve, H. J. Boehm, *Memory models: a case for rethinking parallel languages and hardware*, Comm. ACM, 2010.

The RSL and SLR program logics:

- V. Vafeiadis, C. Narayan, *Relaxed separation logic: a program logic for C11 concurrency*, OOPSLA 2013.
- K. Svendsen, J. Pichon-Pharabod, M. Doko, O. Lahav, V. Vafeiadis, A Separation Logic for a Promising Semantics, ESOP 2018.

The promising semantics for C/C++ 2011:

• J. Kang, C.-K. Hur, O. Lahav, V. Vafeiadis, D. Dreyer, A promising semantics for relaxed-memory concurrency, POPL 2017.