When your app crashes during a traffic spike, the first instinct is usually to scale up and add more servers.
Problem solved, right?
Well not always. Sure, scaling works, but sometimes it doesn’t kick in fast enough, or worse, it just drives up your cloud bill without actually solving the root of the problem.
Because the real issue isn’t always how much traffic you're getting. Sometimes, it's how that traffic arrives. Too many requests hitting all at once. No prioritization. No protection for critical services. No breathing room for your backend to keep up.
That’s where traffic shaping comes in.
In this guide, I’ll walk you through what traffic shaping actually is, how it works under the hood, and why it’s one of the most important tools for modern DevOps teams — especially in the cloud.
I’ll also break down some of the core traffic shaping strategies, how AWS handles them, and the mistakes that trip people up when they try to implement shaping for the first time.
Sidenote: If you find that you’re struggling with the questions in this guide, (or want to build a few impressive projects for your portfolio), then check out my AWS Certified Cloud Practitioner course, my AWS Certified Solutions Architect Bootcamp or my portfolio project on How to build an end-to-end web app with AWS.
All these courses (and more) are included in a single ZTM membership!
With that out of the way, let’s get into this guide.
Traffic shaping is the process of regulating how traffic gets through to your system and when, so that your system can handle it properly.
For example
Think of it like a nightclub. You don’t want everyone rushing the door at once because you’ll overwhelm security, the bar, the bathrooms, everything.
So instead, you let people in gradually. Maybe you hold back the line, maybe you prioritize VIPs, maybe you check IDs more slowly during a rush. That’s traffic shaping. You’re not stopping people from coming in, you’re just smoothing out the flow so things inside don’t fall apart.
Now apply that to an app or a backend service.
Say your system normally handles a few hundred requests per second, but suddenly, a traffic spike hits. Maybe your site got featured somewhere, or maybe a batch of users all log in at once.
Without traffic shaping, every one of those requests slams into your infrastructure at full speed. Your servers try to keep up. CPU spikes. Memory fills. Services start failing. Some users get through. Others get errors. Monitoring lights up. And now you’re firefighting.
Like I said in the intro, a lot of people reach for scaling up with more servers as the answer, and sometimes that works. However, if you’re not cloud based then scaling takes time, and if the spike is too fast or too large, your system might break before the extra capacity kicks in.
And sure, if you are cloud based then servers can be added faster and automatically, but scaling isn’t free. A fast surge in traffic can trigger extra compute, bandwidth, or API costs that stack up quickly.
And I know what you’re thinking:
“So what? More traffic is a good thing right?”
Yes traffic is great, but it could be handled better so you’re not paying as much.
And it’s not just about cost. Some of that traffic might not even be real users. Bots, scrapers, or even DDoS attacks can flood your system with junk requests.
Without any kind of shaping, your backend treats those requests like any other, meaning that you’re wasting resources and pushing out legitimate users in the process.
This is why traffic shaping matters, especially in DevOps and cloud-native environments, because it gives you a way to slow things down before they break your system. You can prioritize certain types of traffic, limit how fast requests come in, or hold back low-priority services so critical ones stay responsive.
It’s not about being restrictive. It’s about being prepared.
Shaping helps you stay ahead of problems instead of reacting to them. And in a world where traffic patterns can change instantly — and costs can change with them — that kind of control is essential.
So now that you know what traffic shaping is, let’s look at how to actually use it.
Let’s walk through the most common ones, and more importantly, why a real system would use them and what happens when they’re pushed too hard.
This is one of the most widely used shaping models, because it’s built to handle the kind of traffic that’s usually low and steady, but sometimes surges.
Here’s how it works
Imagine a bucket that fills with tokens at a fixed rate. Maybe 100 tokens per second.
Every time a request comes in, it needs to take a token to be processed. This means that if the bucket has tokens, then the request gets through instantly. However, if the bucket has run out of tokens, then the request has to wait for more tokens.
It’s kind of like a ticket booth system at a cinema. You queue up, get your ticket, and can go to watch the film. But once the screen is full and no tickets are left, you have to wait for tickets to the next showing before you can move on.
If traffic exceeds the refill rate for too long, the bucket empties and requests are delayed or rejected. It’s not perfect, but you do avoid sudden system crashes because you're enforcing limits based on time, not just volume.
Where it shows up:
Why use this model?
The key idea here is burst tolerance so that the system doesn’t get overwhelmed. It’s perfect for systems where short traffic spikes are normal but sustained overloads are not.
Leaky bucket is stricter than token bucket because it’s not built for flexibility. It’s built for consistency.
How it works
Imagine a bucket again, but this time there’s no tokens or movie ticket limits. However, there is a limit on how many requests can be processed at once.
Kind of like a toll booth on a highway, or a bucket with a tiny hole at the bottom. You can pour in a lot of traffic, but only a fixed amount can flow out through the narrow opening.
This means that any extra requests sit in the funnel, waiting their turn. But if the funnel fills up and more traffic keeps arriving, the overflow spills out, and those excess requests get dropped, often returning a 429 Too Many Requests
error.
Why use this model if it can lose requests?
The leaky bucket method is designed to protect the system's core functionality, even if it means rejecting some users.
It’s kind of like a concert ticketing site when everyone hits “buy” at the same time.
The system doesn't try to serve everyone instantly because it would crash and lose all the sales. So instead, it slows down, lets requests through one by one, and blocks the rest. That way, the service stays up, even if some people have to refresh and try again.
Where it shows up:
Not all traffic is equal, and some requests are more important than others. Certain workloads are critical while others are optional.
The goal of WFQ is to guarantee responsiveness where it matters most.
How it works
Incoming traffic is separated into different queues for each type of request, and then each queue is then assigned a "weight" that defines how often it gets served.
The high-priority traffic continues to flow smoothly, while the lower-priority traffic slows down or backs up.
Where it shows up:
RED is all about avoiding congestion before it becomes a problem. Instead of waiting for a system to overload, it watches the queue length and starts dropping or delaying packets early, before things hit the limit.
It’s like seeing traffic building on a freeway and slowing cars down before they create a jam.
Why use this model?
Because in some systems, if you wait until you’re overwhelmed to act, it's already too late. RED spreads out the pain in small doses to avoid full-on collapse.
Where it shows up:
Yes. In fact, they usually do.
Most production systems combine shaping strategies to get the best of each:
Shaping is rarely just one switch. It's a set of tools, and real systems pick the ones that match their traffic patterns and tolerance for delay, loss, or inconsistency.
So now you know what these methods are and how they work, let’s show you how to set them up.
OK so the good news up front.
If you're building on AWS, you’re not going to be coding token buckets or queuing algorithms from scratch, because AWS does a lot of the heavy lifting for you. All you have to do is configure the settings in a way that matches your traffic patterns and scaling needs.
The trick is just knowing what to set up and where. So let’s walk through how shaping actually shows up in AWS services, starting from the edge of your application and working inward.
(I’ll also share some tips for helping with traffic spikes while not necessarily being traffic shaping but still important).
So let’s break them down.
API Gateway is one of the most common places developers first encounter traffic shaping in AWS, especially if you’re exposing REST or HTTP APIs to the public, and is a direct implementation of the token bucket model.
Here’s how it works:
429 Too Many Requests
It’s a great way to allow some flexibility without letting your API get hammered indefinitely.
You can find this setting in the API Gateway settings under Stage → Throttle Settings. You can set limits globally or per method, and even apply different quotas per API key if you're managing usage plans.
WAF helps shape traffic by detecting and acting on patterns and blocking malicious requests. While not a RED (Random Early Detection) algorithm in a strict sense, its intent is similar: to reduce the chance of system overload by acting early on aggressive traffic patterns.
It does this by allowing you to set rate-based rules:
Instead of letting all traffic in and seeing what breaks, WAF starts filtering aggressive behavior before it hits your infrastructure.
Where to configure it
WAF rules are managed in the AWS WAF console.
You attach them to API Gateway, CloudFront, or ALB resources, and you can also combine rate-based rules with IP blocks, header inspection, or geo-matching to get even more granular.
Load balancers don’t do shaping directly in the way API Gateway does, but they help smooth out traffic by handling bursts and routing intelligently.
The two key features are:
These aren’t shaping mechanisms in the algorithmic sense, but they achieve similar outcomes by pacing how traffic is distributed across your system.
Where to configure it:
I know I said scaling doesn’t solve all issues, and that we need traffic shaping. However, we shouldn’t miss it either, as they work better together.
If shaping slows things down and protects your backend, Auto Scaling adds capacity so you don’t have to throttle forever. The key is tuning it to react fast enough without over-provisioning and burning budget.
Two features worth noting:
Where to configure it
You can set this up inside Auto Scaling Group settings, or using CloudWatch alarms tied to your shaping metrics (e.g., throttled requests, queue depth, response time).
AWS Shield (especially Shield Advanced) is built to protect against malicious DDoS attacks and large-scale floods that are designed to overwhelm your system completely.
Shield also works with WAF and other edge services to block or absorb that kind of traffic.
For example
So although Shield doesn’t shape traffic in the burst-handling sense, it does ensure that bad traffic never makes it far enough to be shaped in the first place.
Where to configure it
Shield Standard is automatic, but shield Advanced is managed via the AWS Shield console and requires activation per resource.
It depends on where you’re shaping and what you’re protecting:
The key to remember is that no one service handles it all.
But together, these tools let you create a layered approach to shaping from the first packet that hits your endpoint, all the way through to how your backend scales and recovers.
Once you start implementing traffic shaping — especially in cloud environments — it’s easy to think you’re covered just because you’ve set a rate limit or added a WAF rule.
But shaping is a control system, and like any control system, it’s easy to misconfigure. Sometimes it works too aggressively. Other times, it doesn’t kick in soon enough.
Let’s walk through the most common mistakes people make and how to avoid them.
This is probably the biggest one.
Many teams pick arbitrary rate limits because they need something in place, but without real usage data, those numbers are either too low (blocking real users) or too high (not protecting anything).
For example
You might set an API Gateway rate limit at 1,000 requests per second, but if 80% of your traffic happens in 5-minute surges, you’ll throttle users unnecessarily. Or worse, your backend might still get overwhelmed because the shaping wasn’t aggressive enough during the actual surge.
How to avoid it
Use CloudWatch, X-Ray, or whatever metrics system you have to track:
Then shape around your real patterns, not just theoretical limits.
It’s tempting to add rate limiting at API Gateway and assume you’re safe. But shaping only works if it’s aligned across your system.
For example
If API Gateway lets 1,000 requests/sec through but your backend database can only handle 200 writes/sec, you’re going to bottleneck fast, even if shaping at the front door looks fine.
Or maybe you shaped traffic at the load balancer, but didn’t consider that your worker pool or function concurrency limit can’t handle the load once requests get inside.
How to avoid it
Make sure shaping limits match downstream capacity. Know where your bottlenecks are. If your backend needs a leaky bucket pattern, your frontend shouldn’t behave like a firehose.
It’s easy to go too far the other way and be so cautious that you block real users during normal usage spikes.
This happens a lot with:
How to avoid it
Test shaping policies under simulated user flows, and use synthetic load testing tools or replay production traffic in staging to see what breaks. You can always start strict and loosen it gradually once you understand the impact.
Auto Scaling is powerful, but it’s reactive in that it only kicks in after load increases. If your system starts to suffer before scaling catches up, your users will still feel the pain.
Worse, if you're shaping poorly, Auto Scaling can get triggered too often, leading to:
How to avoid it
Set up all the shaping methods we’ve talked about in this guide to buy time for scaling to work. Think of shaping as your first line of defense, not a replacement for capacity.
You can’t fix what you can’t see. The problem of course is that most teams enable shaping, but don’t log:
And so as a result, users start complaining about timeouts or broken features, but you have no idea that poorly set up shaping was the cause.
How to avoid it
Always log shaping events especially when requests are dropped or delayed. Monitor 429
responses, WAF blocks, and queue overflow events. Build alerts for unusual patterns so you’re not flying blind.
Just because you drop a request doesn’t mean it’s gone. Many clients automatically retry when they see a 429
or a timeout. So if you’re not careful, this can make the traffic surge worse, because you get a flood of retries that stack on top of the original load, further adding to the issue!
How to avoid it
Traffic shaping isn’t a “set it and forget it” thing. As your system grows, user behavior changes, or new features roll out, and so your shaping strategy might need to evolve too.
The limits that worked for your MVP won’t hold up under production load a year later. And the edge cases that weren’t a problem before such as DDoS attempts or client-side bugs, might suddenly break everything.
How to avoid it
Treat shaping limits like infrastructure maintenance and revisit them periodically. Bake shaping validation into performance tests and chaos engineering drills. Keep it part of your review process and not just your launch checklist.
So as you can see, Traffic shaping helps you stay ahead of outages, cut unnecessary costs, and keep critical services running smoothly, even when traffic surges.
It’s trickier to set up manually, which is one more reason to consider migrating to the cloud if you haven’t already. AWS gives you the tools — you just need to configure them based on how your system actually behaves.
So take the next step: check your traffic patterns, set some shaping rules, and test how your system responds. The sooner you shape your traffic, the less likely you are to be caught off guard.
Try it now, before your next spike makes the decision for you.
Don’t forget - if you want to learn more about how to best work with AWS, then check out my AWS Certified Cloud Practitioner course, my AWS Certified Solutions Architect Bootcamp or my portfolio project on How to build an end-to-end web app with AWS.
Remember - all these courses (and more) are included in a single ZTM membership.
Plus, once you join, you'll have the opportunity to ask questions in our private Discord community from me, other students and other working tech professionals, as well as access to every other course in our library!