Consequences of Distribution In Data Systems

Siddheshwar Kumar
3 min readDec 3, 2019

--

Distribution basically means we have more than one node and they could be sitting alongside on the same rack in a data center or could be as far as into two farthest apart data centers (in USA and in India). These nodes serve the request from clients and they also talk to each other for replication and maintaining cluster health (like gossip in Cassandra). This page talks about the hard limitations of going distributed.

The author of Distributed System for Fun and Profit points two physical limitations of distributed systems:

  1. Information travels at the speed of light, i.e. distance matters. Traveling at the speed of light, one could go around the Earth 7.5 times in one second.
  2. Independent things fail independently. An increase in the number of nodes increases the probability of failure in the system.

Above two are facts of life that can’t be easily wished away. It might sound pessimistic to say that — Anything that can go wrong will go wrong in a distributed system, but that’s how it is!

Also, the failures will not be total, but partial. It’s NOT easy to reason and handle partial failures as they are nondeterministic. It’s safe to assume that a Distributed Data Systems (DDS) with 1000 nodes will always have something broken.

To build such a system we need to fully understand the wide range of problems:

  • Unreliable Networks: Network packets may be lost, unbounded delay in delivering packets, switches/routers might be misconfigured, undersea cables might get damaged. Uncertainty about the network makes it difficult to tell whether a node is working or not. If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The line between right and wrong often gets blurred (and democracy wins).
  • Unreliable Clocks: DDS needs to be concerned about the time as the order of concurrent operations decide the (final) value of an object. It’s impossible to have ONE clock (or all nodes having same clock reading), so the natural state in a distributed system is a partial order. Neither the network nor independent nodes make any guarantees about relative order; but at each node, we observe a local order. It is possible to synchronize clocks to some degree through a centralized NTP (Network Time Protocol) server but NTP synchronization accuracy is itself limited by network round-trip time(RTT). So, at best we can only achieve partial ordering.
  • Process Pause: Processing pause can occur in VM, OS or runtime environments like JVM(GC pause in JVM). GC pauses are quite short, but stop-the-world GC pauses have sometimes been known to last for several minutes. Even the so-called concurrent garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code and they need to stop the world from time to time. Application thread pause can cause serious consequences.

These constraints make implementing distributed systems that act as a single system difficult. Any DDS will have to embrace these limitations and do good things for us. DDS is about building a reliable system from unreliable components.

Consequences of Distribution

The distributed systems build abstract models to help navigate the uncertain world. While designing the distributed data systems the constraints which we discussed above needs to be kept in mind as we can’t have all good things at the same time and this leads to tradeoffs between Availability and Consistency, Latency and Consistency etc. I have discussed some of these models in a separate post here.

References:

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Siddheshwar Kumar
Siddheshwar Kumar

Written by Siddheshwar Kumar

Principal Software Engineer; enjoy building highly scalable systems. Before moving to this platform I used to blog on http://geekrai.blogspot.com/

No responses yet