It doesn’t matter if you’re just starting out, if you’ve been coding for years, if you're an SDE 2, or even if you're not a programmer at all and just work in tech:
System Design + Architecture is one of the most essential skills that you can learn this year.
It sounds too good to be true but in today’s guide, we’re going to break down each of these points, as well as why you shouldn’t just wait till the last minute to learn this, so let’s dive in…
The name kind of gives it away but System Design is ‘the ability to plan and devise a strategy for creating a system or application’ (e.g how to design a functional system from scratch).
Sometimes also referred to as 'System Design + Architecture', probably because of the similarities with construction.
Just like how an architect would chat with their customers and then help them to design their dream house and all the elements that would be put into it, a System Design Engineer would do the same but with a Company's application or website.
System Design is a pretty large topic that covers everything from understanding the end goal of what you and your company want to achieve, figuring out how to make that happen, and then breaking that down into each part of a larger plan so that the system can be built.
This could cover all elements, such as:
It’s an incredibly important skill to have but it’s more commonly seen as something that you cram at the last minute when applying for a Senior Programming role since you need to know it a part of the technical interview process.
This is such a shame and a waste of a valuable skill set that you could be using right now.
I 100% believe that you should learn this today, regardless of your current skill level or even if you’re not yet applying for a Senior position.
Because in today's world, almost everything is built on distributed systems and will have multiple teams working on many moving parts. Even a simple rideshare app might have:
All of which need to work together smoothly for an efficient user experience.
Not only that, but they need to continue to work even when conditions change such as growing traffic loads, because no one wants to crash their app or platform when a new product or feature launches right?
Good system design will help save a company money in infrastructure costs, reduce downtime caused by server crashes and overload, improve the end user experience, and reduce the overall engineering brain drain between teams.
Seriously, how well an Engineer can plan and execute on this is a true mark of experience and seniority.
But like I’ve said already, you really shouldn’t wait until you’re applying for a Senior role to learn this…
So let’s take a deeper look at each of the 5 reasons why you should learn System Design today, regardless of your current level, position or career path goals.
Here’s a clear indication of how important it is to learn System Design, and that’s the fact that every single FAANG company now runs a System Design interview for all hires, even at entry level.
That means that if you’re coming in for a front-end or even bottom-of-rung role, you still need to know this stuff.
Because FAANG is all about scale and skill set.
Not only do they want the very best people who understand the most important concepts, but they also want people who can build tech that can handle large traffic sources.
Only by understanding how your particular project works together as a whole, will you then know how to build in a way that won’t break.
The good news of course is that if you’re applying for FAANG at an entry level, you can still expect higher salaries than anywhere else.
If Google is willing to pay an entry-level salary of $188k for beginner coders who understand System Design, then clearly this is an important topic to know, right?
Maybe you’re not looking at a FAANG role, but you are looking for a Senior position in either your current or another company?
Well, because of the growing need to understand and build with distributed systems, almost every tech company will now have some form of System Design interview.
The interviewer may ask you to design anything from Facebook's News feed to a rideshare app like Uber, or perhaps an URL shortener like Bitly.
During that time, they also what you to cover everything that's involved with that system, such as infrastructure, networking, data shape, storage methods, distribution, redundancy, load control, and the list goes on and on.
Unfortunately though, many Engineers only come across System Design for the first time when they hear about needing it for FAANG interviews, and then they frantically try to speed run through trying to learn it which is EXTREMELY difficult!
System Design is vast with numerous topics that are all complex and deep, requiring you to understand how to build not just your current project, but ANY system and all its possible pieces, so leaving it to the last minute to learn this is a really bad idea.
Most Engineers haven’t even been exposed to half these things let alone have the experience to know how to make the right decisions.
But because these large companies operate at a massive scale with very complex systems, they need Senior Engineers who can contribute to them, and that’s why it's such a key interview topic.
Seriously, start early and learn it now.
What if you’re not applying for FAANG and you’re not going to apply for Senior roles for a few more years? Heck, what if you’re just learning to code now for your very first programming job?
Even in these examples, it's still worth learning system design today because it’ll help you to understand larger, broader concepts that tie everything together.
Employers like team members who show initiative and want to learn more about how their work fits into everything. It's a sign of not only an inquisitive mind but also a drive to improve, which are all 5-star attributes.
This can lead to pay rises and you being more valuable to your company, which is never a bad thing.
With multiple tech layoffs happening, creating more value and building on your expertise is always a great idea when you want to either keep your job, or get hired elsewhere.
One of the most underrated benefits to learning System Design is simply understanding how everything works so that you can communicate it with other people and be part of important conversations and decisions.
Trust me on this. It’s far easier to get manager buy-in when the C-suite understands the reasons behind why it needs to happen.
“Solution A is more expensive up front than solution B because of XYZ issue, but Solution A saves 70% of costs later on, while also being more efficient user experience, lowering bounce rate and churn”
Boom! That’s money in the bank and easy approval for whatever you need to work on.
Seriously, there’s so much value in understanding and being able to communicate these larger systems, that I highly recommend learning System Design even if you’re not a Developer but work closely with dev teams.
You’ll be able to understand everything that’s happening, where your budget is going, timeframes, limitations, and more.
OK so the final benefit is one we’ve hinted at multiple times throughout this post but I want to give you a detailed example so that you can truly see how it can affect your work, decision making, and overall understanding of how systems work together.
Let’s look at a basic situation that might happen at work, so you can see the value in learning this today, and how the obvious answer might not be the right one…
Imagine you’re the Lead Engineer working for a growing e-commerce company with a basic system that looks like this:
You have users using your frontend application which sends requests to your backend servers.
Since you’re a new company, you don’t have that much traffic and one single instance of your backend server is enough for you to handle the traffic you currently receive.
Better yet, you know that your users are averaging a good response time of 100 milliseconds per request.
Everything is nice and fast and your users are happy, but then things start to slow…
Well, as you approach the holiday season, you start to see an increase in user traffic along with way more requests to your lone server. In fact, traffic has grown so much that response times have dropped from 100 milliseconds to now averaging 2 seconds per request!
Not great. This is a bad user experience, and many of your users are leaving your website before completing a purchase thanks to that slow performance.
Well, the reason more traffic is slowing down your server is because of the load on your current available resources. Most likely the initial cause is the amount of RAM on the machine that powers your server.
Simple enough so far right? If this is something you already work on, then you probably have a few ideas of how to fix this, but let's walk through 2 simple options.
At this point, there are two things you can do. You can either:
So which is the best option?
Well, at first glance, going horizontal and adding more servers requires you to make things a bit more complicated since you’d need to change your architecture and spend more on new hardware, whereas adding more RAM is quick and easy to do.
Because you're currently under time pressure to get this resolved during this Holiday season, scaling vertically seems like the obvious solution.
Not only that but you predict traffic to drop back down to normal numbers after the holiday, and so you might not want to take on the cost right away for more hardware just yet.
In this instance, you decide to vertically scale first by adding more RAM to the machine that runs your server, as you can always add more servers later on… but is it the best choice?
It seems like the right idea because the solution worked. After you add more RAM you see your response time goes down to 200 milliseconds per request, and so you breathe a sigh of relief and go home.
For the next week everything seems stable even as your traffic is steadily growing, and you can see that vertically scaling is working!
In a few days though, you know that Black Friday — a massive shopping day in North America where all your users are -- is happening, and you’re not sure if that RAM upgrade is enough.
So in preparation for this spike in traffic, you go ahead and add even more RAM to your servers to help handle it.
Black Friday sales begin at midnight on the dot and you stay up to watch the logs and make sure things go smoothly, but then midnight strikes and almost immediately you see a massive rush in requests to your server, more than any you’ve ever seen before. In fact, your server is getting more requests than in the past month combined!
You hope that you’ve added enough RAM to weather the storm but then you notice even more traffic continues to pile in. (Maybe an influencer announced how much they loved your product with their audience and now your traffic is through the roof.)
Uh oh!... It’s now 1 am and your server logs start telling you that request time latency is increasing more and more, so much so that you’re site has now slowed down to 2 seconds per request.
The response time isn’t great but you’re too deep in the weeds to change anything now and it does seem the amount of requests has stabilized.
Shoppers are getting a bad user experience and you’re definitely losing sales, but they’re starting to checkout and you hope that their excitement for the deals pushes them through and helps keep them waiting, as you continue to monitor the situation.
Then it happens. The moment you’ve been dreading, and you start to see a few requests fail.
A few more minutes later and almost half your requests are failing now! Latency is going through the roof, with response times exceeding 10 seconds until the inevitable happens and your server completely crashes…
Your only hope is to keep rebooting the system and hope your boss doesn’t blame you for losing all those sales.
On paper, adding more RAM should have solved the traffic load issue, but it failed.
Obviously, you don’t want this to happen again, so you do a post-mortem to see what the cause was. You start off by looking at what was going on around the time that requests started to fail.
After a bit of analysis, you notice that your server was receiving requests to pay from users that had already finished shopping...
On further inspection, you notice that you currently use a third-party payment provider to handle purchases. That's fine normally, but the issue today was that it retains a longer connection between the user making the request, your server, and the third-party service.
This means that during the time the payment request is being completed, those resources on your server are locked up and can’t be used.
And so the combination of the increased traffic from Black Friday pushing your server to 100% capacity, and then the timing of the first wave of users all trying to checkout around the same time locked up your server further, caused it to finally crash.
Sure, the initial solution of vertical scaling did help solve your traffic issues to some extent, even when response times went up to 2 seconds due to your server hitting max capacity. And of course, server crashes do happen as there is always a non-zero chance that something fails, but it wasn’t the best solution here.
You could look at perhaps changing service providers for your payment system but that’s just a band-aid solution.
In reality, the system suffered from having a single point of failure, which was having a single lone server, and this is why understand System Design is so important.
In this example, it wasn't just a load issue, but more the fact that when something DOES fail in your system, the entire offering goes down with it, meaning that the system is not resilient to failure.
Clearly, there's a problem with the architecture and design that we need to look at, so let’s look at some solutions.
We mentioned earlier the option of horizontally scaling instead of vertically, but decided against it.
However, if we had the experience of understanding the system and System Design in more detail, we could have predicted that the vertical option wouldn’t have worked, and we would probably have decided on horizontal scaling right away.
Sure, it takes more time to set up and a little more effort, but horizontal scaling would have both helped to solve your latency issue without the need for more RAM, and it would have also eliminated the single point of failure issue that came with just vertically scaling. (And you could even scale vertically after the fact if you needed to).
However, to do horizontal scaling you would need time to change up your architecture, so let’s explore what that would look like, and how it could solve this problem.
So let's take a look at that option.
In this example, we’ve increased the number of servers to three.
What's important to note is that:
The key thing however is that they are isolated from each other, meaning they are not aware of each other nor do they interact with each other.
They work independently and simply receive requests and send back responses to wherever the request comes from.
However, we need to be able to properly receive requests from users and spread them out appropriately amongst the servers.
In fact, our frontend applications have no idea which servers to hit, so we need to centralize the requests to one entry point to our backend, and let the backend handle the rest.
We can do this through the use of a load balancer/ reverse proxy, which is something akin to a server that accepts requests and performs some logic of sending those requests to our actual servers.
Adding one would update our architecture to the following:
This allows all our frontend applications to send requests to our backend system with the load balancer sitting as the single entry point. The load balancer can then distribute the load across the three servers!
Now if any server gets overloaded and goes down, we have two more servers that are still running and our entire system is still running.
This is a simple enough solution so far, but now we need to think about HOW we want to balance the load, and so we need to consider some load-balancing strategies. Again, this is why understanding System Design is so important, because even this solution might not be perfect.
Because how we design and set that load balancing up can affect if the system works well, or if it still fails, so let’s look at a few options…
Random selection is exactly as it sounds. For every request the load balancer receives, it chooses any of the servers available at random, and then sends it to one of them.
In theory, each server has an equal chance to be chosen, which means each server has an equal chance to receive the request. If we have three servers, each server has a 33% chance to receive the request.
The problem of course is that this distribution percentage is only true after A LOT of requests, meaning that it will eventually average out so that looking at ALL requests balanced, each server should eventually each receive 33% of the load.
However, if you were to look at any specific moment in time where there is a smaller amount of requests, the load will be unevenly balanced.
A snapshot at almost any given time of us routing requests will look unbalanced like the following:
This may seem counterintuitive but it becomes really easy to understand when you think about flipping a coin and counting the number of heads and tails.
We know that flipping a coin yields a 50% chance of landing on heads or tails, but that doesn’t mean that by flipping the coin 10 times, you’ll alternate between heads and tails equally.
You can easily end up with heads 4 times in a row before another tails instance. You could end up with 7 heads and 3 tails, or 4 heads and 6 tails, or even 9 heads and 1 tail. These are all possible outcomes from 10 coin flips.
However, we know that if we flipped the coin way more times... let's say 10 million times and then looked at the results of ALL the coin flips, it will be much closer to an equal amount of heads and tails (~5 million occurrences each).
Obviously, this approach will not work for us with load balancing, as we saw that at any given time we will most likely have a heavily unbalanced distribution of requests, and our servers cannot handle unbalanced distribution because this could lead them to crash.
When a server has more requests than it can handle, it has a few options:
One option is to queue the requests, meaning that the requests will be handled in the order they are received. All requests will eventually get a response, but only after all other requests in front of them have been handled.
Another option is for our server to shed excess requests our servers can’t handle, meaning those users won’t get a response or they receive a request failed. Both outcomes are bad, particularly the latter with dropping requests.
We need to avoid overloading some servers while others are available and idle, which we can do with better load-balancing strategies, such as this next option.
Round robin simply takes the servers we have and routes each request to the next server in order. Round robin will guarantee that each server receives its equal share of requests.
While round robin is much better than random selection at balancing the load, it is only perfectly balanced when all requests take the exact same time to complete.
Experience tells us that in real life, request times are not all uniform. This is especially true when you have some requests which take the server longer to finish.
In the previous example, our bottleneck was where our payment system was requiring multiple trips to an external 3rd party payment service, which also had a timing delay.
The thing is, this previous flaw can still currently break our system, even with multiple servers.
Let me explain:
Imagine a scenario where by chance, every third request happens to be a more time-consuming payment request. With three servers, this means that server 3 will receive all of the larger load requests while the other two servers will receive the quicker to process requests.
Round robin will still continue to pass server 3 these requests even though it may be locked up processing the previous requests, while server 1 and server 2 are idle.
This is due to round robin not taking into account each server's current workload.
Why does this matter?
Well, in our original Black Friday example which crashed our server, it was due to the single server receiving these larger payment requests when it was at capacity that caused it to crash. In this scenario, the exact same thing would happen to server 3 so we definitely need to account for a lack of uniformity amongst requests!
If we don’t solve this issue, eventually server 3 will crash followed by another, and then the final one.
Sure it would last longer than a lone server, but why not fix this upfront during the planning and system design right!?
So here are two other (yet similar) strategies we could use:
Both of these methods would be useful here because both of how these strategies prioritize attempting to route requests to the server instance with the least amount of load.
They work like this:
Join shortest queue is exactly as it sounds, in that it routes a new request to the server with the lowest queue length.
Least connections on the other hand has to do with routing new requests to the server with the lowest number of active connections open with clients (i.e. front end applications).
For the sake of clarity, an active connection is when a server and a client have an open connection for requests and responses, which happens when you need multiple requests from a specific client session to continue to go to the same server. This happens if persistence of the session on the server is important, such as in a live chat!
In the case of least connections, the load balancer will route requests to the server with the least active connections since maintaining these connections actively consumes resources from the server.
In our case though, we can instead use join shortest queue while setting it up, so that our servers tell our load balancer their current load/queue size.
Why use this?
Because the size of the queue is not necessarily JUST the number of requests it’s processing, as we know not all requests are equal.
If one server is locked up with one MASSIVE request, we can account for that by having the server assign itself a queue size of 1.
Server 2 may have already completed all of its small requests, so it can assign itself a queue size of 0.
In this case, our load balancer picks the server with the lowest queue size, which is server 2.
We can also combine join shortest queue with round robin if we notice that servers have the same queue size, which can further equally balance the load!
Makes sense right? Now our servers can handle traffic at scale, and won't fail if one of them us under load.
There are numerous other load balancing strategies, but for the context of the problem we encountered, we have a much better solution than before that should be able to survive next year's holiday season, as well as any growth as the company continues to expand, and more influencers and affiliates sending bursts of traffic to us.
Can you see how a fundamental knowledge of System Design helped us to both understand the system's needs, and how to best solve them?
What seemed like the obvious solution at first, failed when it was needed, because we were only looking at the problem from its current issue, and not how it fits into the larger system.
Even the 2nd option would still fail, unless we learned to balance the load properly, and that's why it's so important to learn System Design, regardless of where you are in your career right now, because it has so many benefits.
Now of course, there’s much more to System Design + Architecture than this.
In this example, we simply looked at vertical vs horizontal scaling and load balancing, but there are many other topics to learn in System Design that impact how we can solve real-world business problems we encounter with the solutions we create.
If you haven’t started learning this topic yet, then I highly recommend just diving in, and starting to understand the bigger picture.
You can check out my new course on System Design here. It covers everything you need to know to understand, analyze, and design systems.
Even better news? I'm going to be updating this with new content and scenarios over time as well! And... Part 2 is also already in the works where I'll cover specifically how to prepare and answer System Design interview questions.
Once you've taken Part 1 you'll definitely be able to stand out amongst your peers by being able to ask thoughtful questions to your Seniors about how and why they are making the decisions they are making regarding your company’s systems & architecture.
The best time to start is now! Don’t wait a month before your coding interview to learn this. Get a head start on it now. You'll start understanding your current role better, and when the time comes, you'll also be more than ready to crush that future interview.
By the way, you can also check out all the ZTM courses available here.