1. Week 01: Introduction and Basic Notions

According to famous quote by Leslie Lamport, “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” How do we make sure that the system does not fail in some unpredictable ways, and if it does fail we can still be sure that “nothing bad” has happened?

To answer these questions with the full mathematical rigour, in this module we will study the application of formal methods to specification and verification of correctness-critical distributed systems.

1.1. Terminology

Mandatory Reading

Please, refer to this manuscript by Google for the main terminology about distributed systems.

Highlights

You should get an idea of the following notions

  • messages, protocols, replicas (individual system nodes)
  • pragmatics of implementing computations in a distributed manner
  • failure modes (halting, failstop, omission failure)
  • Eight fallacies of distributed system design

1.2. What can be specified about Distributed Systems

When talking about “formally specifying” a distributed system, we usually mean proving one of the following properties about its global executions:

  • Safety Properties
    • Nothing bad happens ever
    • If is violated, this is done by a finite execution trace
    • Example: mutual exclusion (no more than one process holds a critical resource)
    • Proving safety almost always requires one to discover a system invariant, i.e., a property that holds no matter changes take place in the system.
  • Liveness Properties
    • Something good happens eventually
    • Cannot be violated by a finite computation. The intuition is that intuition we can always run longer and see what happens.
    • Example: no deadlock (some process always makes a progress)

Examples Safety of Liveness and Properties (Quiz)

Which of the following statements defines a safety property and which is liveness?

  1. Patrons are served in the order they arrive.
  2. Anything that can go wrong, will go wrong.
  3. No one expects the Spanish Inquisition.
  4. Two things are certain: death and taxes.
  5. As soon as one is born, one starts dying.

Formal Methods Toolset

  • Model checking (need a finite-state model)
  • Symbolic execution (abstracting away some concrete details, replacing them with logical facts)
  • Deductive verification (applying a set of logical inference rules)

Mandatory Reading

This paper by Amazon outlines the main domains in which formal verification has been used to facilitate the design and evolution of distributed systems engineering.

1.3. Synchrony vs Asynchrony

The way we reason about properties of a distributed system or attempt to develop a mental model or abstraction for the system more or less depends on the nature of distributed system — whether synchronous or asynchronous.

A synchronous distributed system comes with strong guarantees about properties and nature of the system. Because the system makes strong guarantees, it usually comes with strong assumptions and certain constraints.

Synchronous nature by itself is multi-faceted, and the following points will elaborate more on this:

  • Upper Bound on Message Delivery. There is a known upper bound on message transmission delay from one process to another process OR one machine/node to another machine/node. Messages are not expected to be delayed for arbitrary time periods between any given set of participating nodes.
  • Ordered Message Delivery. The communication channels between two machines are expected to deliver the messages in FIFO order. It means that the network will never deliver messages in an arbitrary or random order that can’t be predicted by the participating processes.
  • Notion of Globally Synchronized Clocks. Each node has a local clock, and the clocks of all nodes are always in sync with each other. This makes it trivial to establish a global real time ordering of events not only on a particular node, but also across the nodes.
  • Lock Step Based Execution. The participating processes execute in lock-step. An example will make it more clear. Consider a distributed system having a coordinator node that dispatches a message to other follower nodes, and each follower node is expected to process the message once the message is received. It cannot be the case that different follower nodes process the input message independently at different times and thus generate output state at different times. This is why we say processes execute in lock step synchrony a la lock step marching.

The main thing to remember about synchronous systems is that they allow us to make assumptions about time and order of events in a distributed system. This comes from the fact that clocks are in sync and there is a hard upper bound on message transmission delay between nodes.

The problem with synchronous distributed systems is that they are not really practical. Any software system based on strong assumptions tends to be less robust in real world settings and begins to break in practical/common workloads. For example, relying on the network that it is definitely going to deliver the message in a fixed amount of time is not really a practical assumption. In real world, software system is subjected to multiple kinds of failure.

Asynchronous Distributed System model is more suitable for real world scenarios since it does not make any strong assumptions about time and order of events in a distributed system.

  • Clock may not be accurate or out of sync. Clocks of different nodes in a distributed system can drift apart. Thus it is not at all trivial to reason about the global real time ordering of events across all the machines in the cluster. Machine local timestamps will no longer help here since the clocks are no longer assumed to be always in sync.
  • Messages can be delayed for arbitrary period of times. Unlike synchronous distributed system, there is no known upper limit on message transmission delay between nodes.

Asynchronous distributed system is tough to understand since it is not based on strong assumptions and does not really impose any constraints on time and ordering of events. It is also tough to design and implement such a system since the algorithms should tolerate different kinds of failures. Our algorithms can no longer be designed to handle only a subset of failure conditions by ruling out some failure scenarios using strong assumptions. The onus and challenge of developing robust distributed algorithms is more in asynchronous distributed system.

Choosing the right model for formulating a distributed system is a critical step for formal specification and verification, as it has to strike a balance between the complexity of the system description (hairy definitions are more difficult to validate) and its adequacy (if the formal description is too simple, it may not correspond to an actual implementation).

In the remainder of this lecture, we will review several foundational results that stem from the asynchronous nature of practical distributed systems.

1.4. Time, Causality, and Logical Clocks

Mandatory Reading

Please, refer to the seminal 1978 paper by Lamport for the details of the logical clock construction.

Highlights:

  • Physical time is a poor way to define the relative order of events taking place at different spatial parts of an asynchronous system.
  • In such systems, some events might be in a happens-before relation (aka being causal) or can be concurrent.
  • Happens-before relation defines causality.
  • The builds on the ide of logical clocks to restore the absolute order of events in the system.
  • Clock condition: how to build logical clocks that are consistent wrt. HB relation.
  • Idea: model the global time with “ticks” between event. Problem: not clear how many “ticks” to skip between the events. Solution: make messages carry timestamps and advance the counters. The paper proposes two Implementation Rules, which need to be satisfied in order for the Clock Condition to hold.
  • Logical clocks can help solving the mutual exclusion problem. Essentially, the lock is granted to the process who has made the earliest request, got acknowledgements from everyone, and anyone having holding the lock earlier released it (by means of sending the corresponding messages).
  • This allows to implement synchronisation: now we have total ordering of events in all processes. Also, this algorithm implements a simple replicated state machine. It’s not fault-tolerant.
  • The rest of the paper describes synchronisation of physical clocks.

Mandatory reading

Vector clocks improve on the idea of Lamport clocks.

  • Problem with Lamport clocks – cannot conclude causal relations. Vector clocks allow to have it in an opposite direction: V[a] < V[b] implies a happens before b. Thus, vector clocks give us a way of identifying which events are causally related.
  • Problems with vector clock: the size of the vectors grows.