Building Resiliency in Microservices

Siddheshwar Kumar
10 min readMar 13, 2023

Gone are the days when applications used to be two or three tier (web, app and DB). Now we have tens (and in some cases even hundreds) of interlinked services and most with their own databases. All these components (or subsystems) stack together to serve end users. A fault in one of the service sitting 5 network calls away (from a user) could cascade all the way up and result in a failure. This is where resiliency becomes important. Let’s see the literal meaning of Resilience.

Resilience: the capacity to withstand or to recover quickly from difficulties.

A resilient microservice would be able to withstand some of the faults in itself or its dependencies. So it’s a measure of how many faults (or for how long) the service can tolerate before becoming non-usable.

“Ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.” — wikipedia

Below are two examples of resilient services.

  • What if the service gets double the load than it is expected to serve? If the service is resilient then it might slow down, might start failing fast if it can’t handle request within allowed time but it should recover once the load comes back to normal level.
  • What happens if one of the node (out of 10) goes down? Now each instance gets roughly 10% more traffic. Does it lead to collapse of the whole system or it can withstand the extra load while you try to add the node back to the cluster.

Let’s consider below architecture which is part of an e-commerce platform to provide a list of articles to be displayed on the app. Apps call Catalog service to fetch articles, which in turn relies on 3 dependencies. Catalog service fetches article details from Article Metadata service (like color, size, price etc.), checks if article is available from Stock Service and sorts/personalizes the articles for a given customer using Personalization Service.

Let’s cover some of the patterns which can help to achieve resiliency in services and in turn in the platform / application.

Timeouts

Timeout is a mechanism to stop waiting for response from a service if you think it won’t come. It can help in fault isolation, as a dependency problem shouldn’t become a caller’s problem. And timeouts are not only relevant for network calls, resource pool that block threads must have a timeout to ensure that calling thread unblocks if the resource is not available.

Timeouts can easily get overlooked, but they are very important for microservices. Waiting too long can slow down the whole system (as it can lead to cascading failure) and timing out too quickly could lead to a failure for otherwise a successful request.

If you wait too long on a web page, the user might give up the hope and refresh the page, causing an additional inbound request. Every external API call should have a fixed timeout, and it should be validated and updated with changes in system behavior (based on the latency metrics).

Timeouts can avert Cascading failures. And consider delayed retries, as the immediate retry due to network issue might not help.

If Catalog service waits too long for even one of the dependencies, then the caller is going to wait longer for response. This could lead to a poor user experience (and hence a unhappy user).

Retries

You gave up waiting for response on a page, and then you refreshed the page; that’s basically a manual retry. The same thing can be done programmatically by the HTTP client in the browser, or a microservice.

Some issues with downstream calls are temporary. Packets can get misplaced, or the gateway can have an odd spike in load, causing the time-out. Retrying the call often works in this case.

The obvious question is, when should we perform retry? The retires should be performed based on HTTP status codes. For a 404 (Not Found), retry is not going to help at all. But temporary errors like 408 (Request Timeout), 502 (Bad Gateway), 503 (Service unavailable), or a 504 (Gateway time out) should be retried. Sometimes the downstream might not capture the specific 5xx error code and return 500 always, then also it makes sense to retry (though it would be desirable to return a more specific status code).

If the initial timeout or error was caused due to a stressed downstream service, then bombarding it with additional request is not going to help. So in this case, it’s preferable to retry after some delay.

It’s a good idea to have a timeout budget for each operation and based on that allow the number of retires. If the budget get over in second retry itself (out of total 3) then better to stop. In our example; it makes sense for Catalog Service to retry the call to Article Metadata Service, but if the Personalization service call fails, then it can be ignored (and use a default or no sorting).

Bulkheads

In a ship, bulkheads are partitions that, when sealed, divide the ship into separate watertight compartments. If the hatches are closed, a bulkhead prevents water from moving from one section to another and hence localizes the damage (and prevents sinking of whole ship). We can use the same technique in softwares as well. Redundancy can be built to minimize the damages. Like keeping multiple copies of data (as done by distributed DBs), multiple independent instances of a service so that hardware failure in one doesn’t impact the other, limiting CPU for a service Pod so that one rogue process doesn’t take up all CPU on the host (as done in Kubernetes).

Service can partition available number of threads into different groups and allocate them appropriately for specific tasks like creating separate connection pool for each downstream connection. Bulkheads can prevent faults in one component to lead into a total crash. We can consider using circuit breakers too.

Timeouts and circuit breakers help us free up resources when they are becoming constrained, but bulkheads can ensure they don’t become constrained in the first place. Like creating a thread pool in Catalog Service for calls to Personalization Service.

They can also give us the ability to reject requests in certain conditions to ensure that resources don’t become even more saturated (aka load shedding). Sometimes rejecting a request is the best way to stop an important system from becoming overwhelmed and being a bottleneck for multiple upstream services.

Circuit Breakers

In our homes, electrical circuit breakers (and fuses) exists to protect our electrical devices from spikes in power. If the spike occurs, the circuit breaker gets blown and power stops flowing and thus protecting expensive home appliances. This technique is applied to software by wrapping expansive network operations with a component that can circumvent calls when the downstream is not healthy.

With circuit breaker in places, after a certain number of requests to the downstream resource have failed (due to error or timeout), the circuit breaker is blown (i.e. OPEN) and all future calls that come through the circuit breaker fail fast. After a certain period of time, the service sends a few requests through it (HALF-OPEN) to see if things have improved, and if it gets enough healthy responses, it resets the circuit breaker (i.e. CLOSED).

How you implement a circuit breaker depends on what a failed request means for your case. Usually, it’s a good idea to consider a timeout or a subset of the 5XX HTTP return codes. The tricky part is getting the configuration rights; just like timeouts we shouldn’t be putting a very short or a very long time interval. It’s good to start with some value and then calibrate.

And while circuit breaker is ON, we have an option to deal with incoming requests. They can be queued and retried later if they are part of an asynchronous job, but for a synchronous request it makes sense to fail fast if there is no fallback option available. This could mean propagating error up the call chain or a more subtle degrading of functionality. Catalog Service can have a default sorting approach as a fallback if the circuit breaker is ON (for call to Personalization Service).

Circuit breakers can be a mechanism to seal a bulkhead to protect the consumer from the downstream problem, but also to potentially protecting the downstream services from more calls and may cause adverse impact due to cascading failures. It’s good to mandate circuit breakers for all synchronous downstream calls.

The circuit breaker allows us to fail fast and avoid wasting of valuable time (and resources).

Fail Fast

Timeouts are useful when we need to protect a service from a downstream failure. Fail Fast is useful when we need to report why we won’t e able to process some transaction/request. Fail fast applies to incoming request, whereas timeouts applies primarily to outbound requests.

If a service can determine in advance that it will fail an operation, it’s always better to fail fast. This way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work. The service handler can quickly validate the incoming request parameters to make sure that it’s valid and also the state of the circuit breaker if any. And make sure that you report the right reason for the failure; reporting a generic error may cause an upstream system to trip a circuit breaker just because the user entered the wrong input in the form/text field (i.e. 4xx shouldn’t be thrown as 5xx error).

Back Pressure

Every service has a limit on number of requests it can handle and after that the queuing kicks in either at socket level, OS level or at DB level. The queue has to be bounded, else it can consume all available memory (and latency increases too with length of the queue). We also need to decide what to do once the queue is full.

Block the producer/client? Accept the new item and drop the oldest item? Blocking the client is a way to apply back pressure so that client will be forced to throttle down. The back pressure pattern works best with asynchronous calls. Also, back pressure can help manage the load when the number of consumers are finite, else there won’t be any systemic effect.

Rate Limit

Excessive use of an API by a few clients (or even one) can significantly limit the usefulness of the service for other clients. Simply adding more firepower (i.e. CPU, memory, bandwidth or even scaling horizontally) is not viable economically. In this situation, Rate Limiting is used to limit the number of requests that can be made to a service over a period of time. If the service doesn’t enforce the agreed rate limit contract, then it’s going to impact the performance and possible breach of SLO/SLA. Rate limit is also instrumental in preventing (Distributed) Denial of Service (DoS) attacks.

To enforce this, the usage of the API needs to be monitored, which is possible if the service can identify the user / client (through authentication). This can also help in billing. If a client is expected to make 100 request per second, then it should be enforced by the service (by rejecting the requests when it exceeds the allowed limit; client in turn can apply throttling on their side).

Performing load testing can provide insight into the limits, breaking points, and visible behavior. The most appropriate location to implement rate limiting would be the API gateway. To enforce it, we need to first apply the attribute which would be used to identify client like IP Address, geolocation, or a client/customer ID. Once this is finalized, the service needs to identify the strategy to implement the limiting. The most common strategies are: Fixed window, sliding window, Token Bucket, Leaky bucket.

Service should return HTTP code 429 to signal to the client that it’s applying rate limiting.

Proper HTTP status codes

Looking at this section’s title, you might be wondering how HTTP status codes can influence the resiliency of a service! Let’s take an example from the system — what if Catalog Service blindly retries 3 times for all 5xx status codes from Stock Service? This is possible if the Stock service always returns 500 or Stock Service returns appropriate 5XX error codes, but the Catalog Service ignores it and have a catch-all 5xx handler. This is definitely going to waste some valuable resources which could otherwise have been used for something important. This affects the resiliency of the services (both Catalog and Stock Service gets impacted).

So proper signaling and handling between two services is a MUST. All these and many more happen due to poor handling and understanding of HTTP status codes. Some of the important and often ignored status codes (complete list ref) are —

201 -> Created | 202 -> Accepted | 401 -> Unauthorized | 408 -> Timeout | 424 -> Failed dependency | 429 -> Too many requests (Rate Limit) | 502 -> Bad gateway | 503 -> Service Unavailable (usually for overload) | 504 -> Gateway timeout

As the service builder/designer, it’s your job to return the status code appropriately and also ensure that your consumers understand these status codes and don’t abuse your service.

is that all?

Resiliency patterns covered in this post is NOT complete, and we can do a lot more. Like auto-scale based on the load spike, pre-scale if there is a possibility of sudden surge in load due to a planned (marketing) event, use caching if needed, implement proper observability to detect the problem and trigger alerts so that the team can take appropriate action at the earliest, etc. This was just an attempt to make you conscious of these patterns to improve the resiliency of services.

Happy Learning! If you enjoyed it, please clap 👏 for it.

References:

--

--

Siddheshwar Kumar

Principal Software Engineer; enjoy building highly scalable systems. Before moving to this platform I used to blog on http://geekrai.blogspot.com/