How to approach System Design Interviews

10 min readMay 4, 2022

And here comes yet another post on this topic :)

Even if you are well prepared, system design interviews could be challenging as there is lot to cover in 40–45 minutes. So the obvious question is —can we use any generic approach to tackle any system design interview? Here is my attempt to bell the cat :D

I am discussing here a 5 point model, RECAB (read it as /ˈriːkab/ similar to recap).

RECAB: Requirement, Estimation, Components, Architecture, and Bonus.

Requirement: understand the problem (5 min)

This is the first and one of the most important part of the whole interview as this is the time for you to ask right questions to the interviewer to understand the problem properly and also to identify the scope of the problem. Interviewers sometime intentionally hide certain details for you to discover them by asking right questions. This should roughly take 5 minutes.

Identify the functional requirement of the system; most likely this would be stated clearly by interviewer.
Identify the non-functional requirements like SLA/SLO which includes latency, availability/consistency tradeoffs, number of users, concurrent request etc. Some of the these requirements can also be part of functional requirement
Out of scope is as important as above two as it would help you focus on things which are relevant. This could include user registration/login, Single Sign On, SSL/HTTPS, Security, any spike in load

It’s good idea to list down couple of points for all above 3 points. And refer to these throughout the discussion to make sure that you are always focusing on things which are important.

Estimation: capacity estimation and constraints (10–15 min)

This is the area where you have possibility to show off your estimation skills. One or more of below are 3 areas need to be considered-

Load

Load is expressed as number of requests which the system is supposed to handle in a given time period (minute/hour/month etc). It is expressed as RPS (Request Per Second) or QPS(Query per second).

Keep in mind that (usually) stateless HTTP servers/services are not bottleneck; but the back-end servers/services (which have state i.e. DBs) are going to be bottleneck.

Let’s say we want to support a QPS of 1 million. In this case, the intent is to come up the number of nodes/cluster configuration which can handle this traffic/load. If the question asks to design a system for the given load — focus on the data/back-end part and not the front-end which just acts as a delegator/aggregator/proxy.

Number of queries = 10^6 (1 million) 
Let's start with a single node with 4 cores. 
CPU time available for each query
      => 4/10^6 seconds
      => 4 μs

I know that, it’s impossible to handle 1 million QPS using single machine. Just used to get the numbers.

So, 4 μs for each query is highly impractical (unless you have a super computer :D). Even in-memory system (like Redis, DynamoDB etc) at best guarantee submillisecond (less than a millisecond) for each query. This gives us clue to that we can add more machines (horizontal scaling) to get the job done.

So for the use case, to achieve it in ~4 ms we need 1000 machines (and for 40 ms 10k machines are required). We arrived at the numbers indirectly, if 1 super computer take 4 μs then it means if we are given 1000 normal machines then each query can afford to take 4 ms :)

For 4 ms latency we need 1000 machines.

2. Data Storage

Above case, we considered only load on the system and completely ignored about how much data the system is supposed to store/serve.

Usually the focus of the problem would be either load or data.

Let’s say the system is returning 1 million feeds per second (news / Twitter Newsfeed, Instagram Newsfeed, Quora Newsfeed), so in this case we need to think of the data store which needs to support this throughput. Assume that system needs to store 100 GB (0.1 TB ) of data. This might not be given directly to you. You might be given number of users and period like 5 years of storage of feeds. Using this you can derive how much memory is required.

If there is no expectation of the throughput then it becomes easier; we can decide number of nodes as per cost analysis. Sometime question will just mention high performance system, so good idea to clarify and ask for a number. Even if QPS / throughput is not mentioned in the problem we can calculate it.

If we have 32 GB RAM machine = 100 GB / 32 GB 
      ~ 4 machines (including 20% overhead due to defragmantation)
      => Practically, we would be required to keep replica copies
      => 12 machines (with 3 copies of data)
      => or 24 machines with 16 GB RAM (lower capacity machines are cheaper)

3. Bandwidth

Assume that the service is getting 2TB of data every day (like facebook chat / whatsapp messages).

Incoming data / sec = 2 * 1000 * 1000 MB / 0.1 * 1000,000 seconds    (0.1 million seconds a day)
                    = 20 MB/s
                    = 20 MBps
                    = 160 Mbps

Also, if it’s a chat application then same amount of data needs to go out to the receiver client as well (ignoring multiple clients of the same user and group messages).

Some important numbers/points-

* 1000 Byte => 1 KB. 1000 MB => 1 GB. 
* Number of seconds in a day ~0.1 million
* A century is about 3.14 billion seconds; i.e. Billion seconds = 30 years. 
* 2^32 = 2^2 * 2^30 = 4 billion = 4 * 10 ^9
  Represent 4 billion integers = 4*10^9 * 4 = 16 GB 
* Average size of a chat/message/tweet: 100 bytes
* Java strings are physically stored in UTF-16 encoding which uses 2 bytes to store each char. 
* If CPU is 2.4 ghz then it executes 2.4 billion cycles per second (i.e. 2.4 * 10^9 cycles) . 
 Generally each instruction is associated with certain number of instruction cycles and the clock speed determines the number of instructions that can be executed by the microprocessor in a second. 
* Don't pay too much attention to clock speed; Clock speed is just one aspect of performance. RAM, Graphics card, HDD/SSD, cooling, L1, L2 caches are equally IMPORTANT.

Another pragmatic strategy to arrive at numbers-

There is another brute-force strategy to decide about number of nodes required for a service/component. Run application/service on a single host/vm/container and measure its performance by load test. If a single node achieves a throughput of 100 QPS then we know that to achieve 1000 QPS we need 10 nodes.

Component: Bottom up view

Now finally, it’s time to pick one/two core components/features and design it. If the problem says to store tweets/messages then you need to go through storage and retrieval. But, it’s good idea to always check with interviewer before proceeding. In this section, you can choose to go deeper into DB or some other core component. Since the focus is to find a strategy which can be used for any problem, I am discussing here a more generic thought process —

Database

Even if you are obvious about the kind of DB the system requires, it is good idea to go through the rational and then suggest the specific DB.

Do we need a relational DB or even a NoSQL can do the job? Talk about requirement and then reason it
If NoSQL is ok, then what kind of storage? key-value, document, columnar/column
Do we need in-memory DB/Cache for low latency read/write
Do we need highly available DB? (If yes then NoSQL DBs which favors Availability over Consistency can be used). Good point to bring in the tradeoff of Consistency vs Availability.
Is it ok to have eventual consistency? Talk here what does it mean to be eventually consistent or any risk of diverged replicas.
If you are asked to achieve strong consistency then discuss about CP system and consensus algorithms.
Is there any possibility of reconcilation in case the replicas get diverged?
What should be partitioning approach? (Range based, hash based or Consistent hashing based)
What should be replication strategy? Sync/Async, master-slave replication, leaderless replication
Do we need reliable Persistence or durability? (If yes, then at least write should be acknowledged only if write is complete in file system and not only in cache at least on one node. Even better it should be done at least at 2 nodes.)
Good to ask — Consider successful if replicated or stored in file system? Talk in detail about which DB you are preferring and why? If you suggesting to use Cassandra, talk in detail the reason. And what are the benefits.

2. RESTful API design

Focus here on identifying the resources quickly and possible operations on the resources. I would not suggest to spend lot of time in this section. If you have identified business objects and operations then you can quickly suggest operations on those objects and move ahead.

Just use the relative URI like GET /api/{resources}/{resource}
PUT vs POST (Use put if you have resource id)
You can talk about Idempotency (POST is usually not idempotent, GET is. PUT, DELETE can be made idempotent).

Architecture: connecting the boxes (10–15 mins)

Time to draw big and beautiful boxes and connect them through arrows! Make sure to keep communicating your thoughts while drawing the architecture as it might be tough to capture all details in the diagram. We don’t need to draw and put all the details; but very important to communicate assumptions explicitly to interviewer. Also, depending on how much time you have you can decide how detail you should go.

Asynchronous component (like processing image/videos): Use Queue/Streaming platform like Kafka
Cache / low latency read/write: Use Redis, Couchbase/Memcache
File Storage or Block Server (for image / video): AWS S3, HDFS
Highly available and fault tolerant persistence: Cassandra,Dynamo
Highly consistent storage: Etcd
Storage which can help in membership management: Zookeeper, Etcd
Search / Full text search: Elasticsearch / Apache Solr
Reactive communication between client and server: use WebSocket instead of plain vanilla HTTP
Load balancers: AWS ALB for achieving high availability and distributing traffic/requests

Here is a general purpose architecture diagram.

Bonus:

If everything has gone great so far (or even otherwise) then the interviewer might ask some follow up questions at the end.

These topics can go along with any component of the model, RECAB. I just added these under bonus as they don’t directly fit into things discussed so far, but they can very well be discussed along with above components. Some of these might also fall in Out of scope category of Requirements.

Rate Limiting

Rate limiting refers to preventing the frequency of an operation from exceeding some constraint. In large-scale systems, rate limiting is commonly used to protect underlying services and resources. Istio, Apigee, Akamai etc are some of the services/platforms which support this. Rate limiting saves a service/application against DDOS (Distributed Denial of Service) attack. This is quite helpful reference to understand Rate Limiting in detail.

2. SLO/SLAs

Clients and services engage in a Service Level Agreement (SLA), a formally negotiated contract where a client and a service agree on several system-related characteristics, which most prominently include the client’s expected request rate distribution for a particular API and the expected service latency under those conditions. SLAs are expressed and measured at the percentile of the distribution (99.9, 99.0).

An example of a simple SLO is — a service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.

3. Concurrency

Concurrency is particularly important in distributed environment especially when the problem deals with data or shared state. Below are some of the strategies to handle it.

Pessimistic Locking/Mutual Exclusion: Provide mutual exclusion (ME) to the contended resource while mutation takes place. This is usually implemented with a locking strategy (Threading in programming language). This is quite expensive as it requires many more CPU cycles than the actual transaction to be applied to the business logic would use. Those waiting to enter the critical section (in Java using Synchronized keyword), in advance of performing mutation use queue and this queuing effect(Littles’ law) causes latency to become unpredictable and ultimately restricts throughput.

Optimistic Concurrency Control: Optimistic strategy involve taking a copy of the data, modifying it, then copying back the changes if the data has not mutated in the meantime. If the changes has happened in the meantime you repeat the process until successful. This process increase with contention and therefore causes a queuing effect just like ME. The ability to perform changes atomically to data is made possibly by CAS instructions offered by hardware.

Single Writer: This is about designing a system whereby any item of data/resource is only mutated by a single writer/thread. It is ok if multiple threads read the same data. CPUs can broadcast read only copies of data to other cores via the cache coherency sub-systems. It scales very well (used in Disrupter library, Redis).

4. Push Notification

If there are mobile clients then notifications is an important component. Users should be notified of the email/chat messages and other important events.

For Push notifications, each user can opt-in from their device (or a web browser) to get notifications whenever there is a new message or event. Each manufacturer maintains a set of servers that handles pushing these notifications to the user. To have push notifications in the system, a Notification server needs to be build, which will take the messages for offline users and send them to manufacture’s push notification server, which will then send them to the user’s device.

Cloud providers like AWS provides services (like SNS) to send push notification or messages to mobile devices.

5. Data Deduplication

Data deduplication is a technique used for eliminating duplicate copies of data to improve storage utilization. It can also be applied to network data transfers to reduce the number of bytes that must be sent. For each new incoming chunk, we can calculate a hash of it and compare that hash with all the hashes of the existing chunks to see if we already have same chunk present in our storage.

We can implement deduplication in two ways in our system — Post-process deduplication or In-line deduplication.

6. Operations/deployment of system

In this section we can talk all the things which are required to run the systems in production/live environment. This could include CI/CD pipeline, Traffic switching, schedule/on-demand scaling, dashboard of production components which monitors Golden signals (Latency, Error, Load etc), post mortems of the incidents, distributed tracing/observability etc.

final thought-

The order mentioned here through RECAB model is my favorite order to approach the problem. Does it mean this is THE way? of course not!

It’s important to be flexible and check with the interviewer before proceeding with each section and adjust as per the situation or interviewer’s feedback or (subtle) hints.

Happy learning!