Hold Up! The Importance of Throttling in Service Oriented Architecture

Hold Up! The Importance of Throttling in Service Oriented Architecture

When I first started at Qualtrics, the engineering organization was in the midst of rearchitecting our product from a monolithic application to a microservice oriented architecture. Along the way we learned a lot about designing, maintaining, and operating services. Throttling has perhaps been the most memorable principle I learned.

Now, throttling isn't all that exciting. It's not memorable for me because it was exciting. It is memorable for me because of how many times we had to relearn its importance and fix the same mistakes over... and over. Seriously, I can't even count how many times we had an incident because an internal client wasn't handling throttles properly. Even when they were fixed, they were often only fixed for one endpoint and we would still have incidents because the client was still not handling throttles on other endpoints.

What is Throttling?

Throttling is simply an action your service can take to tell a client that it can't service their request at this time due to high load, or because the client has exceeded their allotted request quota.

In HTTP services a throttle is communicated by returning a 429 Too Many Requests status code. This tells the client that either they are sending too many requests, or that the server has too many requests to service in general. A Retry-After header can also be returned by the server so the client knows when it is safe to retry the request.

Sounds simple enough, right? So what's the deal? The deal was that we first had to learn to protect our service with throttling, and then we had to learn that each client must respect those throttles. And we had to learn this multiple times.

Enemies Within

Your most abusive customers are often internal. This was certainly the case for us. This isn't intentional, instead it's out of ignorance or negligence.

A top cause of incidents in the early days of managing a service oriented architecture was an internal DDoS attack caused by another service within our own data center. This could be caused by high load in general that our service was not prepared for, or sometimes due to erroneous or abusive access patterns.

Story Time:

We had an internal client who integrated with our service to provide automated importing of customer's data. To test their integration they set up a periodic job that would export a dataset, then reimport it, and then check that all the data was imported.

They wanted to only export and import the original set of data but instead they exported the entire set of data, every time. This meant that if their starting dataset was 10 records, their first request would import 10 records. After 5 requests, the dataset will have grown to 320 records and the 6th request would export and import all those records again. After 20 requests the dataset would have 2,097,152 records which were exported from, and then reimported to, our service. This exponential growth quickly knocked our service over.

We had a request size limit, but that wasn't enough. The client already split their operation into multiple requests to respect the size limit. What we also needed was a limit on the number of requests per second to truly protect ourselves.

Defensive Limits

The responsibility to ensure that our service was not being abused falls not to the clients of the service, but to us as the developers. We often don't have direct control over clients. So, it's our responsibility to ensure proper limits are enforced for each client. You need to code and design as if your internal clients were actually external clients that might try to take advantage of you. They will use your API in ways you did not expect. It's your job to make sure they don't.

Improve Resiliency with Retries

Software solution resilience refers to the ability of a solution to absorb the impact of a problem in one or more parts of a system, while continuing to provide an acceptable service level to the business. --Samir Nassar, IBM

While throttles can help protect your service, making throttles a part of your design throughout your system makes for a more resilient system.

Here's a real life example of how we were able to improve the resiliency of our system with throttling. We have a data pipeline that pulls data from DynamoDB, transforms that data, and then loads that data into a Dashboard for our customers. Pretty standard ETL stuff. Below is a diagram of what this typically looks like.


Nothing in this system was designed to handle throttles. DynamoDB issues throttles when you exceed your provisioned throughput. When we started to reach our provisioned limits we started dropping data loading jobs because the Raw Data Service would get throttled by DynamoDB and it in turn would return a 500 to the Data Transform Service.

429 article - DDB_Throttle.png

Because our DynamoDB table was autoscaled we usually only had to slow down our requests for a short time until the table scaled up to handle more throughput. To fix this we bubbled the throttle up through the Raw Data Service as a 429 status code. But now we also had to refactor the Data Transform Service to handle 429 status codes, otherwise it would just drop the job as before. We refactored the Data Transform Service to handle 429 status codes by retrying those requests after a short wait.

After these refactors, our pipeline became more resilient. We could start a data loading job and we knew that if our DynamoDB table needed time to scale up to the demand, our service could handle it without dropping the job.

So which scenario sounds better.

  1. I have some data going through a data pipeline on its way to be populated in a dashboard. At some point in the pipeline there is a service that is not able to handle a request because of high load. The request fails or times out and the data is not able to move forward in the pipeline and is ultimately unavailable to the end user. This may require manual remediation, if it is even remediated at all.
  2. Same scenario but this time that service is issuing 429s when it is receiving too many requests. Now instead of entering a failed state due to a timeout or 500, we instead receive a 429 and then wait and retry our request. While the data ingestion is delayed, it does complete and doesn't require manual remediation.

Scenario 2 ends up being a better customer experience usually. I would much rather my data make it into a dashboard with a delay than to not have it at all, or have to wait for an engineer to fix it.

But how do we know when to retry? If we retry right away we will likely receive another 429. Below are two of the most common approaches to timing a retry.

Retry using "Retry-After"

A Retry-After header can be returned by the server. This header tells the client exactly when they can retry again. It's advantage is that it lets the server better control the traffic it is receiving. But not all servers return this header with a 429, or have a good way to calculate that for you. So how long should you wait then?

Retry with exponential backoff

Exponential backoff, the backbone of network congestion control, can be used at the application level to make our system more resilient.

We could simply retry after a fixed period of time, and in some cases this might be acceptable. The disadvantage of this approach is that it can start to congest our server during high load as more and more requests start to retry at the same interval. A better solution would be to wait longer and longer between retries.

One way to do this is to wait twice as long as the previous retry, and to keep increasing our backoff exponentially. Generally we will cap this backoff to a maximum value, and add some jitter to decrease the odds of us sending retries at the same time as other requests. This approach can help our server perform better and recover faster because it doesn't have to dedicate as many resources to answering requests it can't currently service, since our requests are waiting longer and longer to retry.

Retrying 500s

I've talked about retrying 429s here, but you can retry several different kinds of replies. I would argue that 500 Internal Server Error status codes are another one you should consider retrying with exponential backoff. A 500 often means that the server is temporarily in an error state and retrying after a short wait is often successful. This isn't always the case but it is something you will want to consider doing to make your system more resilient.


An important part of having a resilient system is making sure your components can respond to one another appropriately. If you have a server under high load it is usually better to have a client slow down its requests instead of the system entering a failed state. To improve your resiliency, consider designing throttling into your system and retry any failures that would be reasonable, such as a 500 status code.