Replication: Distributed Data Systems Patterns

Siddheshwar Kumar
7 min readDec 10, 2019

Partitioning (part of DDS Pattern), is about dividing the data set evenly and to spreading it among multiple machines. What happens if a node gets bumped out, and we only have partitioning in place (i.e. single copy)? Read/write operations stop since there is no data redundancy. So the only possible solution is to keep multiple clones or replicas of the data set.

Replication is about keeping a copy of the same data set on multiple machines. Each node that stores a copy is known as a replica. It helps to achieve high availability and durability. One of the most fundamental aspects of replication is whether the replication method is synchronous (waits for confirmation from replicas before responding), asynchronous (doesn’t wait at all), or semi-synchronous(waits only for a few replicas, read about Kafka’s In-sync replica).

Let’s cover the commonly occurring replication patterns:

Leader Based Replication

Used in: PostgreSQL (since version 9.0), MySQL, Oracle Data Guard, MongoDB, Couchbase, RethinkDB, Espresso, Kafka, RabbitMQ, Redis.

One of the replica is designated as the leader (master or primary); other replicas are known as followers (read replicas, slaves, secondaries or hot standbys). Whenever the leader writes new data (after receiving a write request from the client), it also sends data change (replication log or change stream) to all of its followers. Each follower takes the log from the leader and updates its local copy of the database (in the same order).

When a client wants to read, it can query either the leader or any of the followers. However, writes are only accepted on the leader (followers are read-only from the client's point of view).

There is a single leader in this approach and if the leader node gets kicked out due to crash or network partitioning then the cluster needs to upgrade one of the followers as the leader. Clients need to be re-configured to send their writes to the new leader and other followers need to start consuming data changes from the new leader. this process is called failover. It can happen manually by the administrator or automatically by the system. There is no foolproof way to determine that the leader has failed; so most systems simply use a timeout. In this case, the most updated follower is chosen as the new leader (to minimize any data loss). The system might not be available for write during the time of actual error and until the rest of the nodes detect it and failover happens.

Choosing a new leader is a consensus problem and can be done through the election process where a leader is chosen by the majority of remaining replicas. And in case if the old leader comes back, it might still believe that it is the leader (causing the problem of split-brain), not realizing that the other replicas have forced to step it down. The system needs to enforce that the old leader becomes a follower and recognized the new leader.

Multi-Leader Replication

Used in: Calender app on personal devices(like phone, laptop), Real-time collaborative editing applications allow several people to edit a doc simultaneously (e.g. Google docs, Etherpad, wiki page)

Leader based replication approach has one drawback — there is only one leader and all writes must go through it. If clients can’t connect to it, write stops. This is why the leader based replicated systems are not highly available.

An obvious extension is to allow more than one node to accept writes. We call it master-master or active/active replication. In a multileader configuration, we can have a leader in each data center so that the data system can tolerate the failure of an entire datacenter or maybe we want it to be closer to the user. So within each data center, regular leader-follower replication is used; between datacenters, each datacenter’s leader replicates its changes to the leaders in other datacenters.

Write conflicts is the biggest problem of multi-leader replication.

In a multi-leader setup, two or more writes will be successful and conflicts are only detected asynchronously at a later point of time. At that time, it may be too late to ask users to resolve the conflicts.

Conflict detection can be made synchronous by waiting for the write to be replicated before telling the user that the write was successful. However, this way we are losing the main advantage of multi-leader replication — allowing each replica to accept writes independently. If we want synchronous conflict detection, we should better use single-leader replication.

In a single leader case, writes are applied in sequential order. If there are several updates to the same field, the last write determines the final value of the field. In a multi-leader configuration case, there is no defined ordering of writes so it’s not clear what the final value should be.

Replication

Leaderless Replication

Used in: Dynamo, Riak, Cassandra, Voldmort — aka Dynamo style DBs

In leader based replication, if the leader is unavailable then writes won’t work until failover is complete. In a leaderless replication, failover doesn’t exist.

This category of systems uses Quorum based approach for replication and consistency is achieved by object versioning. If there are n replicas, every write must be confirmed by w nodes to be considered successful and must query at least r nodes for each read. Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for (success). A common choice is to make n as odd (typically 3 or 5) and set w = r = (n+1)/2 rounded up. The values of n, w and r can be tuned to achieve the desired level of performance, durability, consistency, and availability SLAs.

w + r > n

Reads and writes that obey these values are called quorum reads and writes. and we expect to get up-to-date value (consistent value) when reading. We can think of r and w as the minimum number of votes required for the read and writes to be valid. In dynamo style DBs these are configurable. Quorum is not necessarily majorities — there should just be one common node in write and read. The meta-information which captures the version/timestamp of the write event enables to identify the most recent value. In this model, the latency of get (or put) operation is dictated by the slowest of the r (or w) replicas.

w + r ≤ n

The quorum condition is not satisfied. If n = 3, w = 2 and r = 1; there is still a possibility that read was performed from the node which didn’t get write. In this case, reads and writes will still be sent to n nodes, but a smaller number of successful responses are required for the operation to succeed. This case results in diverged replicas. On the upside, this configuration allows lower latency and higher availability. Only after the number of reachable replicas falls below w or r does the database become unavailable for writing or reading, respectively.

The challenge with this approach is that it can lead to conflicting changes which must be detected and resolved. The challenges are who (resolves it) and when (it gets resolved). If resolution happens at write time (to keep the read complexity simple), writes may be rejected if all replicas can’t be reached. This is a design consideration. If a data store plans to be always writable (like DynamoDB) then the conflict resolution has to be done at read time. Now, there are choices again on who resolves conflict— either client application or the data store. If it’s done by the client then it needs to merge the values (with or without) taking inputs from its user. If it's done by data store then the last write wins policy can be applied.

how nodes catch up-

The replication scheme should ensure that eventually all the data is copied to every replica. How do unavailable nodes or stale values catch up?

  • Read Repair: Clients can detect stale responses and then writes the newer value back to the replicas. This approach works well for frequently read values.
  • Anti-entropy process: In addition, some DBs have a background process to constantly look for differences and copies missing data from one replica to another. Unlike the replication log in leader-based replication, this approach doesn’t not copy writes in any particular order so there may be a significant delay before data is copied. Without this, the rarely used values may be missing from some replicas and thus have reduced durability because the read repair is only performed when a value is read by the application.

In a large cluster, it’s likely that the client can connect to some nodes during network partitioning. So tradeoffs in this case -

  • Return error to all requests if we can’t reach a quorum.
  • Accept writes and write them to some nodes that are reachable (SLOPPY QUORUM). Once network interruption is fixed writes are sent to the appropriate home nodes. This is called a hinted handoff.

Sloppy quorums are optional in all common Dynamo implementations. In Riak, they are enabled by default, and in Cassandra and Voldemort, they are disabled by default.

Replication provides a context for many subproblems, such as leader election, failure detection, consensus, and atomic broadcast.

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Siddheshwar Kumar

Principal Software Engineer; enjoy building highly scalable systems. Before moving to this platform I used to blog on http://geekrai.blogspot.com/