Distributed Data Systems Patterns
This is a series of posts that try to uncover the common patterns in Distributed Data Systems (referred to in the post as DDS). The objective is to balance the theory, and practicality to cover major aspects in the quickest and crispiest possible way. DDS refers to a wide range of systems like Casandra, DynamoDB, HBase, Kafka, Redis, Couchbase, Zookeeper, Elasticsearch, and even the clustered relational DBs. In short, it covers any system which deals with data (irrespective of the data model, storage approach, or any other constraints) in a distributed environment. These systems fall under the category of databases, caches, key-value store, queues/messaging/streaming systems, etc.
The topic is pretty huge; so daunting to fully understand even if you read half a dozen books; and on top of that, it’s an active area of research, so it keeps on evolving. The posts focus on the fundamentals and core patterns of the distributed data systems.
Distributed Data Systems Top Down
Mikito Takada: Distributed programming is the art of solving the same problem that you can solve on a single computer using multiple computers.
The larger the number of nodes/computers in DDS, the higher the probability of failure. And larger components also affect performance as the need for communication among nodes increases. So, clearly going distributed comes with a cost!
Let’s quickly cover reliability and performance before resuming on patterns?
In spite of all the distributed system constraints (discussed more in detail, here), we want to design systems that are scalable (ability to handle the higher load in a capable manner), performant (low latency / short response time, high throughput, low resource utilization), and available (i.e. uptime/total_time should be close to 1). Latency/Response time and availability are measured in percentiles like P95(95th percentile), P99(99th percentile), etc. If 95th percentile response time is 1.5 seconds in the last 5 minutes, that means 95 out of 100 requests took less than 1.5 seconds, and 5 out of 100 requests took 1.5 seconds or more.
It’s important to keep in mind that availability is a much wider concept than uptime(easier to measure). A system might be up but not reachable or slow due to network issues for its users. Can we call such a system available? Certainly NOT! A true way to measure availability should be from the perspective of users. This is why it’s difficult to measure the availability of a system. One commonly used practice is to measure fault-tolerance (i.e. the ability of a system to behave in a well-defined manner once faults occur). To be able to design a fault-tolerant system — we need to identify possible faults (one or more node crash, a node running out of memory, latency in data replication, etc) in the system so that we can monitor how the system behaves when these faults happen.
A reliable system should not let partial failures (faults) result in total failure. Apart from these indicators, for a data system, durability, and consistency are equally important.
The irony is that we can’t have all the good things discussed above; one comes at the cost of the other. So, it gets a bit tricky!
As a system designer, we should always ask our stakeholders about what’s important to them and then design accordingly. For a cache system, durability might not be important. Similarly, an online system might be happy to be eventually consistent to reduce latency.
back to patterns-
The below diagram covers the major patterns of DDS. I have tried to put all the major aspects of DDS into a single diagram. There is a lot going on there so if you find the below diagram confusing; just follow with the rest of the posts and I am hopeful that it will make sense :)
High-level patterns of DDS
These patterns are the secret sauce for transforming a bunch of unreliable components into a reliable system. Patterns in the above diagram are categorized into four main components. So, instead of attempting to decode above diagram in one go, I have divided them into multiple parts (did you hear divide and conquer? :D). If we understand them in isolation we understand it fully. Below links go over these components; in my preferred order but you can follow your own order as well.
- Consequences of Distribution: This post talks about limitations that get enforced once you decide to go distributed. And a separate post which talks about CAP and PACELC theorems
- Storage: Persistence is a core aspect of the data systems. The diagram covers two major approaches used by most of the systems. Didn’t blog about this topic as I wanted to focus on core distribution patterns (but I may be do it later).
- Distribution Mechanism (Partition & Replication): This is the most important aspect of DDS; I refer them together as PR.
- Consensus: Consensus is the basis of distributed coordination services, locking protocol, and databases.
References:
- Dynamo Paper: Highly valuable paper, popularized the concept of consistent hashing in distributed DBs.
- Distributed System for fun and profit
- Notes on distributed systems for young blood
- Awesome distributed systems: A curated list of good papers/links.
- Consistent Hashing
- Designing Data-Intensive applications: One of the most interesting books on the topic.
Thank you for reading! If you enjoyed it, please clap 👏 for it.