6. Week 06: Specifying Systems in TLA+¶
This week, we look at one of the most popular tools for specifying various kinds of discrete-event systems (including distributed ones), checking and proving their safety and liveness properties: Leslie Lamport’s TLA+. [1] We will be using it to paraphrase the formal description of protocols and systems from the previous lectures in a form that makes them amenable to automated lightweight verification by means of model checking. Such verification will not ensure the total absence of bugs, but will provide high-confidence guarantees of systems correctness by means of exhaustive checking the specification on a bounded state space.
Mandatory Reading
Read the paper How Amazon Web Services Uses Formal Methods, which motivates the use of TLA+ and similar tools in industry.
[1] | Did you really think Lamport won’t make yet another appearance in this class? :) |
6.1. Software Setup¶
We will be using VSCode to write and check specifications in TLA+. Please, make sure to install Visual Studio Code and its TLA+ Extension, following the instructions on the linked pages.
The TLA+ toolbox can be also installed for command-line as a third-party software. For instance, it can be installed on macOS via Homebrew (see tla-plus-toolbox).
The examples used in the lecture are available in this GitHub repository. Open the contents of the
repository as a new workspace and then open SimpleProgram.tla
. Next, execute
the TLA+: Check model with TLC
action (via Ctrl-Shift-P
or
Cmd-Shift-P
). If you see “Success” on the right, your setup is complete.
6.2. Learning TLA+¶
There is an abundance of online tutorials on TLA+, so below I give a list of those that streamline the introduction. Please, make sure to follow them outside of the class and work out the details necessary for the exercises and your projects:
Main Resources
- The best starting point are Lamport’s introductory videos on TLA+.
- In addition to the videos, check the corresponding chaptes of The TLA+ Hyperbook.
- Another great introduction is Learn TLA e-book by Hillel Wayne.
Further Reading
- Specifying Systems provides the specification for TLA+ framework, as well as some advanced examples and mathematical background.
- An extensive collections of real-world examples from various resources can be found at this GitHub repository.
- Practical TLA+: Planning Driven Development.
Advanced Topics
6.3. Specifying State Transition Systems in TLA+¶
Mandatory Reading (Watching)
The pragmatics and basics of TLA+ are given in Videos 1-7 of the video course. Additional details can be found in the first chapters of the Hyperbook.
Highlights:
- In TLA, systems are defined as state machines with their changes described via steps. An execution is represented as a sequence of states, and a step is a change from one state to the next one.
- As in our previous studies, specifications for systems describe all possible behaviours by means of (a) providing an initial state of the system and (b) defining steps for the system to evolve.
- The control flow in ordinary programs are modelled by means of introducing program counters.
- Both states and steps are defined as predicates on variables. Steps use “primed” variables do denote the values in the next state.
- The TLC model checker performs state space exploration for verifying the specification.
The accompanying code repository provides speicications and safety invariants for the following case studies:
- Jug Filling Problem from Die Hard 3 (file
DieHard.tla
). - Transaction Commit Specification (
TCommit.tla
); make sure you understand the invariants - Two-Phase Commit (
TwoPhaseCommit.tla
), as an instance ofTPCommit
, but now with some implementation details, namely, messages. - Single-Decree Paxos, as another instance of
TPCommit
. This protocols’s state space explodes easily, so only model-checked for very small values. To ensure that the framework is useful, try removinga2
from one of the quorums (this will also require adapting symmetry conditions).
Note
When checking the basic examples, such as SimpleProgram.tla
, you might
need to disable the TLC deadlock checker by adding the line CHECK_DEADLOCK
FALSE
to your model configuration file (*.cfg
).
6.4. Temporal Reasoning, Stuttering, Liveness, and Fairness¶
TLA+ utilises temporal specifications for checking that an implementation (a low-level specification) implements a (high-level) specification. Make sure to check the Lecture 8 of TLA+ video course for details on how to define those.
Mandatory Reading (Watching)
The principles of tempotal reasoning in TLA+ are outlined in Videos 8-9 of the video course.
Highlights:
- We revise the files
TCommit.tla
andTwoPhaseCommit.tla
. - Variables in TLA+ are global (i.e., not scoped to a module they are defined).
This is a feature that allows to check preservation of properties defined in a
different module. For instance, both
TwoPhaseCommit
andTCommit
define the variablermState
, which is, thus, considered the same in the statement of the theoremTPSpec => TCSpec
. That is, the implication works as the spec on the right restricts only a subset of variables of the spec on the left. - Implication on specifications allows for stuttering, i.e., skipping the equivalen states in the (more refined) specification on the left.
- Deadlock is identified as an infinite sequence of stuttering steps. All our systems so far allow for deadlocks as they admit stuttering steps when no involved variables change.
- So far, our specs only tell what systems may do (hence the possible stuttering), but not what they must do.
- TLA+ has normal lists with all usual operations. See
SeqSpec.tla
for an example. You can parse it and then runTLA+: Evaluate Expression
action for an input such asRemove (3, <<1,2,3,4>>)
. - For the fair definition, consider the specification in
ABSpec.tla
. - Liveness property can always be satisfied by a behaviour by completing it with a number of states.
- Weak fairness of action A asserts of a behavior: If A ever remains
continuously enabled, then an A step must eventually occur. It’s written as
WF_vars(A)
in TLA+. Thevars
subscript ensures that this step is not stuttering, i.e., it will change some of the variables. - An implementation of an alternating bit protocol is given in
AB.tla
. It utilises sequences to represent message buffers. Notice that, according to the spec,A
andB
can keep sending messages faster than they get lost or received. There is no limit to how long the sequencesAtoB
andBtoA
can be, so we might need to limit that using state constraints. - Trying weak fairness in the example shows that it’s not enough to prove the
temporal specification: the weak fairness doesn’t guarantee that
receive-messages, once enabled, will remain enabled forever. Therefore, the
protocol can make no progress, because it gets enabled when the message is
sent, but is disabled once it’s lost. Weak fairness of
ARcv
is true because it is never continuously enabled (hence the whole property is vacuous). - Strong fairness of action A asserts of a behavior: If A ever remains
repeatedly enabled, then an A step must eventually occur. It’s written as
SF_vars(A)
in TLA+. Making sure thatARcv
andBRcv
are subject of strong liveness makes abstract temporal specification true.
6.5. Implementation Refinement¶
Mandatory Reading (Watching)
The (2-part) Lecture 10 of the video course explains refinement proofs in TLA+.
Highlights:
The substitution principle for temporal formulas only holds if they are “persistent”, i.e., hold at any state/step, as in \((\square~(v = e)) \Longrightarrow (f = (f[e/v]))\).
The new module
AB2.tla
is a modified version ofAB.tla
where messages can be corrupted and detected as such.When we try to show that the fairness specification of
AB2.tla
implies fairness in the style ofAB.tla
, we notice that this is not true. The reason for that is explained in the comment right after the definition ofFairSpec
. One way to fix it is to introduce auxiliary boolean variables that state which messages are going to be corrupted, and which won’t. An example of such extension is given in a derived moduleAB2P.tla
(check the comments for more details). Notice that the constraints for model-checking had to be redefined for tractability.Importantly, the (low-level)
AB2
protocol correctly implementAB
(high-level), where theLoseStep
ofAB
is implemented byCorruptMsg
ofAB2
. The transformation from the “richer” state ofAB2
(as it contains bad messages) to the state ofAB
is obtained by simply “erasing” bad messages, which is done by the (recursive) functionRemoveBad
defined inAB2H.tla
. This module defines a so-called refinement mapping between states ofAB2
andAB
.More specifically, in
AB2H.tla
,AtoB
andBtoA
are imaginary (auxiliary) variables that are added as an impedance matched to relate the behaviours ofAB2
to those ofAB
.An alternative way to define the refinement mapping is to follow an example in
AB2.tla
andAB.tla
, in which the correspondence between their statements is established by building a refinement mapping that removes corrupted messages from the state ofAB2
as done in line:AB == INSTANCE AB WITH AtoB <- RemoveBad(AtoB2), BtoA <- RemoveBad(BtoA2)
The result of establishing the refinement mapping is a “proof” that
AB2
produces behaviours equivalent to those ofAB
modulo the corrupted messages.
6.6. Some Recent TLA+ Formalisations¶
Almost all major distributed protocols have been specified in TLA+: among those are Paxos, Disk Paxos, Flexible Paxos, BFT, Abstract by Guerraoui et al, and Raft (see this paper for more references).
Here are some more modern developments (the list is courtesy of Jialin Li):
- There is more consensus in Egalitarian parliaments (EPaxos)
- Just Say NO to Paxos Overhead: Replacing Consensus Overhead with Network Ordering
- Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control
- Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection
- Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories