10000-foot view of Consensus in Distributed Systems

4 min readDec 30, 2019

Consensus is a fundamental problem in distributed systems to solve the problem of multiple servers agreeing on a value. Once servers reach a decision on a value, that decision is final. It is also referred to as non-Byzantine agreement or simply agreement. It’s called as one of fundamental principles as it’s required for the system to function correctly (Kubernetes will not work properly if Etcd which uses consensus algorithm fails). Consensus enables to get the reliability and performance of a distributed system without having to deal with the consequences of distribution (e.g. disagreements/divergence between nodes). This post focuses on a very high-level overview of consensus patterns.

Let’s try to cover some of the consensus problems:

Atomic commit (whether to commit a transaction to a database or not, used mostly in relational DBs, and distributed transaction across multiple microservices)
Leader/Master election (for coordinating cluster health state, replicating state)
Consensus-based Replication (used by single copy systems like Etcd)
Atomic broadcasts (solves consensus and vice-versa)
Distributed locking (required for co-ordination, the purpose of a lock is to ensure that among several nodes that might try to do the same piece of work only one does it at one time)

Consensus algorithms vary in the type of fault which it can handle or tolerate.

Non-Partition tolerant Approaches

This is the happiest case scenario where you assume that all is well (no network split, there is an upper bound on latency and nodes are not going to go rogue or hacked). This is why these are considered as a poor consensus algorithm.

The consensus in this scenario is relatively a trivial problem and used mostly by relational DBs. Two-phase commit (2PC) is one such algorithm. 2PC strikes a decent balance between performance and fault tolerance, which is why it has been popular in relational databases. 2PC uses two phases to commit or abort a transaction:

In the first phase (voting), the coordinator node sends an update to all the participants. Each participant processes the update and votes whether to commit or abort (and keeps the changes temporarily).
In the second phase (decision), the coordinator decides the outcome and informs every participant about it. If all participants voted to commit, then the update is taken from the temporary area and made permanent.

2PC is prone to blocking since a single node failure (participant or coordinator) blocks progress until the node has recovered. Recovery is often possible thanks to the second phase, during which other nodes are informed about the system state. Note that, 2PC assumes that the data in stable storage at each node is never lost and that no node crashes forever. Data loss is still possible if the data in stable storage is corrupted in a crash.

The fundamental difficulty with 2PC is that, once the decision to commit has been made and communicated to replicas, the replicas act upon the commit statement without checking to see if every other replica got the message. This limitation in 2PC led to a more robust 3PC algorithm.

Check these links (2PC & 3PC) to read it in more detail.

Partition & latency (non-Byzantine fault) tolerant Approaches

Partition tolerant consensus algorithms are about maintaining single-copy consistency. By default, this is the class of algorithm we refer to when we say consensus algorithms. Paxos and Raft are the two most commonly used leader-based consensus protocols where the task of data updates and replication is handled by a leader.

This class of algorithm relies on a majority vote (i.e. requires the majority of nodes ) rather than all of the nodes (as in 2PC). So a minority of nodes can be down, slow, or unreachable due to network partition and still consensus can be achieved provided (N/2 + 1)-of-N nodes are up and accessible.

The holy grail of this category of the algorithm is Paxos. The Paxos algorithm was first described by Turing Award winner Leslie Lamport in 1990. It was incredibly hard to understand and so Lamport published a new paper - Paxos Made Simple in 2001. Google’s Chubby distributed lock service is one of the most popular Paxos implementations and it’s widely used in Google. The major problem with Paxos is (even now) that, it’s very hard to understand. This led to a group of researchers at Stanford to look for an alternative with one of the primary focus on understandability and they finally came up with Raft Algorithm(In Search of an Understandable Consensus Algorithm)

To enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered.

Byzantine-fault tolerant Approaches

This category of consensus algorithm includes the worst kind of things which could happen to nodes of the cluster. One or more nodes could start cheating on their own or by external sources (like a hacker), or a bug (null pointer exception) could causes the system to behave in an unpredictable manner.

The partition-tolerant consensus is anyway such a difficult problem and recommended to avoid if possible; now on top of that if one or more nodes go rouge things will get incredibly difficult to understand and handle. It’s more complicated and also expensive to implement. So Byzantine-fault tolerant algorithms are rarely used in commercial systems.

Distributed data systems patterns.

10000-foot view of Consensus in Distributed Systems

Non-Partition tolerant Approaches

Partition & latency (non-Byzantine fault) tolerant Approaches

Byzantine-fault tolerant Approaches

References:

Written by Siddheshwar Kumar