Partitioning: Distributed Data Systems Patterns

Siddheshwar Kumar
5 min readDec 4, 2019

This post talks briefly about Partitioning, one of the patterns used in Distributed Data Systems (DDS). This is how DDS applies the famous divide and concur algorithm to solve the problem of scale. Along with scale, partitioning also helps to achieve fault tolerance and low-latency. Partitioning is also referred to as Sharding.

There is a limit to the amount of data can be put in a single node. And even if we can put all data in a single node, what if the cost of doing so is very high and we are not able to achieve the required performance through a single node(high-throughput and low-latency).

Partitioning is the process of dividing the data set into smaller, distinct and independent chunks and placing them over multiple nodes.

Partitioning

Let’s cover the techniques which are used to divide the data set:

  1. Range Partitioning

Assign a continuous range of keys (from some minimum to some maximum) to each partition. Within each partition, we can keep keys in sorted order(SSTables and LSM-Trees). This helps in range scans. Uniform partitioning is not possible practically though, and it leads to the hot spots (partitions corresponding to the range of time).

This strategy is used by Bigtable, its open-source equivalent HBase, RethinkDB, and MongoDB.

2. Hash Partitioning

Hash partitioning technique shards data based on the hash value of keys (rather than keys themselves as done in the previous case). The range of hash values (for any data type) usually falls in the range of 0 – 2³²-1.

The problem which hash-based partitioning approach tries to solve is — distribute key-values pairs across servers so that they can be found again, without using any directory or lookup. The obvious approach to achieve this is called mod-N hashing.

Given N servers, we hash key with the hash function and take the resulting integer modulo N.

server := serverList[hash(key) % N]

The approach is quite simple and intuitive but has a drawback — if we change the number of servers (add/remove) all keys will have to be re-mapped. Ideally, if we add or remove one node then only 1/N the of the keys should move. Think of the effect it will have if you are dealing with a huge data set, say 50 billion. Any change in the number of servers would cause massive exchange of items across the nodes — this will render the cluster almost non-functional.

This is where Consistent Hashing comes into play. It got introduced into the year 1997 but it got popularized by DynamoDB.

Consistent Hashing is a distribution hashing technique that operates independently of the number of servers or objects in a distributed hash table. In short, don’t map the keys/data directly to the server.

This approach is used by DynamoDB, Cassandra, Riak, and systems like load balancers as well where we need to distribute the load over multiple servers.

The basic idea is that each server node is mapped to a point on a circle with a hash function. To look up the server for a given key, we hash the key and find that point on the circle. Then we scan forward until we find the first hash value for any server. In fact, in practice, each server appears multiple times on the circle. These extra points are called virtual nodes, or vnodes and help reduce the load variance among servers. This algorithm is the popular ring-based consistent hashing algorithm.

Please refer to this link if you want to read this in more detail.

3. Secondary Index Partitioning

The approach discussed so far relies on a key-value data model. If records are only ever accessed via their primary key, we can determine the partition from the key and use it to route read and write requests to the partition responsible for that key. This section covers secondary indexes that are used to perform querying/searches for the occurrence of particular value across multiple partitions and they don’t return a unique record/document but rather a list (of results).

Assume that, we have a data store that stores metadata of cars; let say it has fields like color, model, etc. Now, what if there is a need to return all cars which are either red or black. This can be solved only if secondary indexes are created on the attribute/field color. So, the secondary indexes need to be created beforehand (before performing queries).

Indexes are basically a data structure that stores extra details to help in the search. So, in distributed DBs index data can either be stored locally (on the same data node) or globally for all data nodes. Each approach has it’s own pros and cons.

Secondary indexes are bread and butter of relational databases and they are common in document DBs too. Many key-value stores such as HBase avoided secondary indexes because of their added implementation complexity but some DB like Riak, and Couchbase have added them as they are very useful for data modeling. And secondary indexes are the sole reason why systems like Elasticsearch exist.

Rebalancing Partition

The process of moving load from one node in the cluster to another is called Rebalancing. This is required no matter which partitioning scheme is used. And during rebalancing, the read/write should work as expected and it should minimize the network and disk I/O load.

How does this work?

There is a simple solution — create many more partitions than there are nodes and assign several partitions to each node (Couchbase creates 1024 partitions known as vBuckets). Now if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. Only entire partitions are moved between nodes. The number of partitions doesn’t change nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes.

References:

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Siddheshwar Kumar

Principal Software Engineer; enjoy building highly scalable systems. Before moving to this platform I used to blog on http://geekrai.blogspot.com/