6. Week 06: Specifying Systems in TLA+¶

This week, we look at one of the most popular tools for specifying various kinds of discrete-event systems (including distributed ones), checking and proving their safety and liveness properties: Leslie Lamport’s TLA+. [1] We will be using it to paraphrase the formal description of protocols and systems from the previous lectures in a form that makes them amenable to automated lightweight verification by means of model checking. Such verification will not ensure the total absence of bugs, but will provide high-confidence guarantees of systems correctness by means of exhaustive checking the specification on a bounded state space.

Mandatory Reading

Read the paper How Amazon Web Services Uses Formal Methods, which motivates the use of TLA+ and similar tools in industry.

[1]	Did you really think Lamport won’t make yet another appearance in this class? :)

6.1. Software Setup¶

We will be using VSCode to write and check specifications in TLA+. Please, make sure to install Visual Studio Code and its TLA+ Extension, following the instructions on the linked pages.

The TLA+ toolbox can be also installed for command-line as a third-party software. For instance, it can be installed on macOS via Homebrew (see tla-plus-toolbox).

The examples used in the lecture are available in this GitHub repository. Open the contents of the repository as a new workspace and then open SimpleProgram.tla. Next, execute the TLA+: Check model with TLC action (via Ctrl-Shift-P or Cmd-Shift-P). If you see “Success” on the right, your setup is complete.

6.2. Learning TLA+¶

There is an abundance of online tutorials on TLA+, so below I give a list of those that streamline the introduction. Please, make sure to follow them outside of the class and work out the details necessary for the exercises and your projects:

Main Resources

The best starting point are Lamport’s introductory videos on TLA+.
In addition to the videos, check the corresponding chaptes of The TLA+ Hyperbook.
Another great introduction is Learn TLA e-book by Hillel Wayne.

Further Reading

Specifying Systems provides the specification for TLA+ framework, as well as some advanced examples and mathematical background.
An extensive collections of real-world examples from various resources can be found at this GitHub repository.
Practical TLA+: Planning Driven Development.

Advanced Topics

6.3. Specifying State Transition Systems in TLA+¶

Mandatory Reading (Watching)

The pragmatics and basics of TLA+ are given in Videos 1-7 of the video course. Additional details can be found in the first chapters of the Hyperbook.

Highlights:

In TLA, systems are defined as state machines with their changes described via steps. An execution is represented as a sequence of states, and a step is a change from one state to the next one.
As in our previous studies, specifications for systems describe all possible behaviours by means of (a) providing an initial state of the system and (b) defining steps for the system to evolve.
The control flow in ordinary programs are modelled by means of introducing program counters.
Both states and steps are defined as predicates on variables. Steps use “primed” variables do denote the values in the next state.
The TLC model checker performs state space exploration for verifying the specification.

The accompanying code repository provides speicications and safety invariants for the following case studies:

Jug Filling Problem from Die Hard 3 (file DieHard.tla).
Transaction Commit Specification (TCommit.tla); make sure you understand the invariants
Two-Phase Commit (TwoPhaseCommit.tla), as an instance of TPCommit, but now with some implementation details, namely, messages.
Single-Decree Paxos, as another instance of TPCommit. This protocols’s state space explodes easily, so only model-checked for very small values. To ensure that the framework is useful, try removing a2 from one of the quorums (this will also require adapting symmetry conditions).

Note

When checking the basic examples, such as SimpleProgram.tla, you might need to disable the TLC deadlock checker by adding the line CHECK_DEADLOCK FALSE to your model configuration file (*.cfg).

6.4. Temporal Reasoning, Stuttering, Liveness, and Fairness¶

TLA+ utilises temporal specifications for checking that an implementation (a low-level specification) implements a (high-level) specification. Make sure to check the Lecture 8 of TLA+ video course for details on how to define those.

Mandatory Reading (Watching)

The principles of tempotal reasoning in TLA+ are outlined in Videos 8-9 of the video course.

Highlights:

We revise the files TCommit.tla and TwoPhaseCommit.tla.
Variables in TLA+ are global (i.e., not scoped to a module they are defined). This is a feature that allows to check preservation of properties defined in a different module. For instance, both TwoPhaseCommit and TCommit define the variable rmState, which is, thus, considered the same in the statement of the theorem TPSpec => TCSpec. That is, the implication works as the spec on the right restricts only a subset of variables of the spec on the left.
Implication on specifications allows for stuttering, i.e., skipping the equivalen states in the (more refined) specification on the left.
Deadlock is identified as an infinite sequence of stuttering steps. All our systems so far allow for deadlocks as they admit stuttering steps when no involved variables change.
So far, our specs only tell what systems may do (hence the possible stuttering), but not what they must do.
TLA+ has normal lists with all usual operations. See SeqSpec.tla for an example. You can parse it and then run TLA+: Evaluate Expression action for an input such as Remove (3, <<1,2,3,4>>).
For the fair definition, consider the specification in ABSpec.tla.
Liveness property can always be satisfied by a behaviour by completing it with a number of states.
Weak fairness of action A asserts of a behavior: If A ever remains continuously enabled, then an A step must eventually occur. It’s written as WF_vars(A) in TLA+. The vars subscript ensures that this step is not stuttering, i.e., it will change some of the variables.
An implementation of an alternating bit protocol is given in AB.tla. It utilises sequences to represent message buffers. Notice that, according to the spec, A and B can keep sending messages faster than they get lost or received. There is no limit to how long the sequences AtoB and BtoA can be, so we might need to limit that using state constraints.
Trying weak fairness in the example shows that it’s not enough to prove the temporal specification: the weak fairness doesn’t guarantee that receive-messages, once enabled, will remain enabled forever. Therefore, the protocol can make no progress, because it gets enabled when the message is sent, but is disabled once it’s lost. Weak fairness of ARcv is true because it is never continuously enabled (hence the whole property is vacuous).
Strong fairness of action A asserts of a behavior: If A ever remains repeatedly enabled, then an A step must eventually occur. It’s written as SF_vars(A) in TLA+. Making sure that ARcv and BRcv are subject of strong liveness makes abstract temporal specification true.

6.6. Some Recent TLA+ Formalisations¶

Almost all major distributed protocols have been specified in TLA+: among those are Paxos, Disk Paxos, Flexible Paxos, BFT, Abstract by Guerraoui et al, and Raft (see this paper for more references).

Here are some more modern developments (the list is courtesy of Jialin Li):