6. Week 06: Specifying Systems in TLA+

This week, we look at one of the most popular tools for specifying various kinds of discrete-event systems (including distributed ones), checking and proving their safety and liveness properties: Leslie Lamport’s TLA+. [1] We will be using it to paraphrase the formal description of protocols and systems from the previous lectures in a form that makes them amenable to automated lightweight verification by means of model checking. Such verification will not ensure the total absence of bugs, but will provide high-confidence guarantees of systems correctness by means of exhaustive checking the specification on a bounded state space.

Mandatory Reading

Read the paper How Amazon Web Services Uses Formal Methods, which motivates the use of TLA+ and similar tools in industry.

[1]Did you really think Lamport won’t make yet another appearance in this class? :)

6.1. Software Setup

We will be using VSCode to write and check specifications in TLA+. Please, make sure to install Visual Studio Code and its TLA+ Extension, following the instructions on the linked pages.

The TLA+ toolbox can be also installed for command-line as a third-party software. For instance, it can be installed on macOS via Homebrew (see tla-plus-toolbox).

The examples used in the lecture are available in this GitHub repository. Open the contents of the repository as a new workspace and then open SimpleProgram.tla. Next, execute the TLA+: Check model with TLC action (via Ctrl-Shift-P or Cmd-Shift-P). If you see “Success” on the right, your setup is complete.

6.2. Learning TLA+

There is an abundance of online tutorials on TLA+, so below I give a list of those that streamline the introduction. Please, make sure to follow them outside of the class and work out the details necessary for the exercises and your projects:

Main Resources

Further Reading

Advanced Topics

6.3. Specifying State Transition Systems in TLA+

Mandatory Reading (Watching)

The pragmatics and basics of TLA+ are given in Videos 1-7 of the video course. Additional details can be found in the first chapters of the Hyperbook.

Highlights:

  • In TLA, systems are defined as state machines with their changes described via steps. An execution is represented as a sequence of states, and a step is a change from one state to the next one.
  • As in our previous studies, specifications for systems describe all possible behaviours by means of (a) providing an initial state of the system and (b) defining steps for the system to evolve.
  • The control flow in ordinary programs are modelled by means of introducing program counters.
  • Both states and steps are defined as predicates on variables. Steps use “primed” variables do denote the values in the next state.
  • The TLC model checker performs state space exploration for verifying the specification.

The accompanying code repository provides speicications and safety invariants for the following case studies:

  • Jug Filling Problem from Die Hard 3 (file DieHard.tla).
  • Transaction Commit Specification (TCommit.tla); make sure you understand the invariants
  • Two-Phase Commit (TwoPhaseCommit.tla), as an instance of TPCommit, but now with some implementation details, namely, messages.
  • Single-Decree Paxos, as another instance of TPCommit. This protocols’s state space explodes easily, so only model-checked for very small values. To ensure that the framework is useful, try removing a2 from one of the quorums (this will also require adapting symmetry conditions).

Note

When checking the basic examples, such as SimpleProgram.tla, you might need to disable the TLC deadlock checker by adding the line CHECK_DEADLOCK FALSE to your model configuration file (*.cfg).

6.4. Temporal Reasoning, Stuttering, Liveness, and Fairness

TLA+ utilises temporal specifications for checking that an implementation (a low-level specification) implements a (high-level) specification. Make sure to check the Lecture 8 of TLA+ video course for details on how to define those.

Mandatory Reading (Watching)

The principles of tempotal reasoning in TLA+ are outlined in Videos 8-9 of the video course.

Highlights:

  • We revise the files TCommit.tla and TwoPhaseCommit.tla.
  • Variables in TLA+ are global (i.e., not scoped to a module they are defined). This is a feature that allows to check preservation of properties defined in a different module. For instance, both TwoPhaseCommit and TCommit define the variable rmState, which is, thus, considered the same in the statement of the theorem TPSpec => TCSpec. That is, the implication works as the spec on the right restricts only a subset of variables of the spec on the left.
  • Implication on specifications allows for stuttering, i.e., skipping the equivalen states in the (more refined) specification on the left.
  • Deadlock is identified as an infinite sequence of stuttering steps. All our systems so far allow for deadlocks as they admit stuttering steps when no involved variables change.
  • So far, our specs only tell what systems may do (hence the possible stuttering), but not what they must do.
  • TLA+ has normal lists with all usual operations. See SeqSpec.tla for an example. You can parse it and then run TLA+: Evaluate Expression action for an input such as Remove (3, <<1,2,3,4>>).
  • For the fair definition, consider the specification in ABSpec.tla.
  • Liveness property can always be satisfied by a behaviour by completing it with a number of states.
  • Weak fairness of action A asserts of a behavior: If A ever remains continuously enabled, then an A step must eventually occur. It’s written as WF_vars(A) in TLA+. The vars subscript ensures that this step is not stuttering, i.e., it will change some of the variables.
  • An implementation of an alternating bit protocol is given in AB.tla. It utilises sequences to represent message buffers. Notice that, according to the spec, A and B can keep sending messages faster than they get lost or received. There is no limit to how long the sequences AtoB and BtoA can be, so we might need to limit that using state constraints.
  • Trying weak fairness in the example shows that it’s not enough to prove the temporal specification: the weak fairness doesn’t guarantee that receive-messages, once enabled, will remain enabled forever. Therefore, the protocol can make no progress, because it gets enabled when the message is sent, but is disabled once it’s lost. Weak fairness of ARcv is true because it is never continuously enabled (hence the whole property is vacuous).
  • Strong fairness of action A asserts of a behavior: If A ever remains repeatedly enabled, then an A step must eventually occur. It’s written as SF_vars(A) in TLA+. Making sure that ARcv and BRcv are subject of strong liveness makes abstract temporal specification true.

6.5. Implementation Refinement

Mandatory Reading (Watching)

The (2-part) Lecture 10 of the video course explains refinement proofs in TLA+.

Highlights:

  • The substitution principle for temporal formulas only holds if they are “persistent”, i.e., hold at any state/step, as in \((\square~(v = e)) \Longrightarrow (f = (f[e/v]))\).

  • The new module AB2.tla is a modified version of AB.tla where messages can be corrupted and detected as such.

  • When we try to show that the fairness specification of AB2.tla implies fairness in the style of AB.tla, we notice that this is not true. The reason for that is explained in the comment right after the definition of FairSpec. One way to fix it is to introduce auxiliary boolean variables that state which messages are going to be corrupted, and which won’t. An example of such extension is given in a derived module AB2P.tla (check the comments for more details). Notice that the constraints for model-checking had to be redefined for tractability.

  • Importantly, the (low-level) AB2 protocol correctly implement AB (high-level), where the LoseStep of AB is implemented by CorruptMsg of AB2. The transformation from the “richer” state of AB2 (as it contains bad messages) to the state of AB is obtained by simply “erasing” bad messages, which is done by the (recursive) function RemoveBad defined in AB2H.tla. This module defines a so-called refinement mapping between states of AB2 and AB.

  • More specifically, in AB2H.tla, AtoB and BtoA are imaginary (auxiliary) variables that are added as an impedance matched to relate the behaviours of AB2 to those of AB.

  • An alternative way to define the refinement mapping is to follow an example in AB2.tla and AB.tla, in which the correspondence between their statements is established by building a refinement mapping that removes corrupted messages from the state of AB2 as done in line:

    AB == INSTANCE AB WITH AtoB <- RemoveBad(AtoB2), BtoA <- RemoveBad(BtoA2)
    
  • The result of establishing the refinement mapping is a “proof” that AB2 produces behaviours equivalent to those of AB modulo the corrupted messages.

6.6. Some Recent TLA+ Formalisations

Almost all major distributed protocols have been specified in TLA+: among those are Paxos, Disk Paxos, Flexible Paxos, BFT, Abstract by Guerraoui et al, and Raft (see this paper for more references).

Here are some more modern developments (the list is courtesy of Jialin Li):