Integration Patterns: Retries and Dead-Letter Queues
Retries and Dead-Letter Queues allow you to make multi-step processes reliable in the face of failure in order to avoid the Grumpy Customer Problem, but implementing them properly is quite a toil. Thankfully, LittleHorse helps keep your engineers happy by simplifying this process.
This is the fourth part in a five-part blog series on useful Integration Patterns. This blog series will help you build real-time, responsive applications and microservices that produce predictable results and prevent the Grumpy Customer Problem.
- Saga Transactions
- The Transactional Outbox Pattern
- Queuing and Backpressure
- [This Post] Retries and Dead-Letter Queues
- [Coming soon] Callbacks and External Events
Review-Processing Application
In the previous post, we discussed an application which accepts reviews of a product, and first uses a third-party AI service to detect offensive content before finalizing and accepting the review. The slow response times (and flakiness) of the AI service make asynchronous processing a natural fit for this use-case: once the review is saved into a database or queue, you can immediately respond to the client and analyze the review using the AI service later.
We saw that we can successfully make our API responsive and performant for our users either by using a traditional message queue or event log (such as AWS SQS or Apache Kafka) or by using an orchestrator. However, using a message queue has significant advantages in terms of debugging, monitoring, and long-term maintainability.
We haven't yet addressed what happens when the flakey external service fails. In order to avoid the Grumpy Customer problem, we must periodically retry the request (and subsequent processing) until the service responds successfully.
With Normal Queues
When we discussed the classic message queue architecture in the last post, we left off with a consumer that polled records from a queue, then made the external API call and either rejected or approved the review. In pseudo-java, naive code might look like:
for (Review reviewToProcess : queue.iterAllEvents()) {
ReviewAnalysisResponse reviewAnalysis = thirdPartyService.analyzeReview(request);
if (reviewAnalysis.containsOffensiveContent()) {
rejectReview(reviewAnalysis);
} else {
acceptReview(reviewAnalysis);
}
queue.acknowledge(reviewToProcess);
}
But what about when the call to analyzeReview()
fails (because the service is flakey)? There are a lot of questions we need to consider.
Do we let the exception bubble up, crash our pod, restart it, and then poll the message again (since we never called acknowledge()
)? This might work if outages are very short and intermittent, but not for sustained downtime of the queue service. Additionally, what if there's a "poison pill" message that, for whichever reason, cannot be successfully processed (eg. a deserialization error)?
Furthermore, if we start spamming the system with retried requests, won't we end up making a bad problem worse? Lastly, when do we "give up" for an individual message, and what happens when we do?
In order to harden our app, we will do two things:
- Introduce retries with exponential backoff for failed messages.
- Send reviews to a dead-letter queue when they are fully exhausted.
Exponential backoff retries is a strategy where the delay between retry attempts increases exponentially, often with random jitter, to reduce contention and prevent synchronized retries. Read more here.
Retries with Backoff
A wise man once said, "if at first you don't succeed, try, try again!" Processing messages from a queue is no different. However, if we naively retry every message immediately upon failure, we risk exacerbating the unavailability of our flakey service by overloading it with many retry requests at once. Exponential Backoff solves this problem by increasing the delay between retry attempts exponentially. This reduces the risk of accidentally DDOS'ing the flakey service. By adding a delay between attempts, we can safely wait until the flakey service comes back to life, and then resume processing.
The first step towards implementing retries with exponential backoff is catching failures and enqueuing them with appropriate delay. Once we catch a failure, our consumer process needs to:
- Calculate the amount of time to wait for the next retry.
- Schedule a message to be delivered at that time.
Some queues have native support for retries; however, implementing exponential backoff is still quite a challenge.
In order to calculate the amount of time to wait for the next retry, we follow the Exponential Backoff formula:
time = base * exponent^n
Here, the n
represents the attempt number. Therefore, for a given Review
, we need to know the last time that we attempted to process it, which requires adding some metadata to the message. This is not the end of the world, but this piece of extra complexity adds one more opening for bugs to appear.
Once we have calculated the next appropriate delivery time for the retried message, we must schedule it for delivery. Some queues, such as AWS SQS, support this natively with "delay queues." However, other systems such as Apache Kafka and AWS Kinesis do not support delayed messages. For these cases, we would have to persist the messages ourselves and use a cronjob to re-publish the messages to the queue at a later date.
Introducing the DLQ
Even with retries and exponential backoff, there might be some Review
s that we quite simply cannot process (or the third-party service might be in especially dire straits one day). For these messages, we will use the Dead-Letter Queue Pattern.
A Dead Letter Queue (DLQ) is a secondary queue used to store messages that cannot be processed successfully after a predefined number of retry attempts or due to specific errors. It helps isolate problematic messages for later inspection and debugging without disrupting the processing of valid messages.
Once we have failed a certain number of times to process a Review
, we can publish a message to a failed-reviews
queue (or topic). DLQ's can be used purely for debugging and manual intervention purposes; however, applications can also consume from a DLQ and take automated actions on the messages in the queue.
For example, our review-processing application might have an additional consumer which notifies the customer (or the on-call team, or both) that the review wasn't processed.
Retrying With an Orchestrator
In the previous section, we saw how retries with exponential backoff allow us to we can more reliably process our reviews in the face of a flakey third-party service. However, there was a lot of manual toil and additional infrastructure to set up (potentially involving a database and cronjob depending on whether our queue service supports delayed messages).
Adding Retries
As you might have guessed, this is much easier to accomplish with a durable workflow orchestrator. As we discussed in the last post, we can model the entire end-to-end process (analyze the review, and either save or reject it) in a single WfSpec
, as shown below:
public void reviewProcessorWf(WorkflowThread wf) {
WfRunVariable userId = wf.declareStr("user-id").searchable();
WfRunVariable review = wf.declareStr("review");
WfRunVariable approved = wf.declareBool("approved");
// Call the flakey API with exponential backoff.
var reviewResult = wf.execute("analyze-review", review)
.withRetries(10)
.withExponentialBackoff(ExponentialBackoffRetryPolicy.newBuilder()
.setBaseIntervalMs(5000)
.setMultiplier(2.0F)
.build());
approved.assign(reviewResult.jsonPath("$.okay"));
wf.doIfElse(approved.isEqualTo(true), handler -> {
handler.execute("notify-customer-approved", userId);
}, handler -> {
handler.execute("notify-customer-rejected", userId);
});
}
With just this block, we defined our retry policy and don't need to worry about calculating the delay, keeping track of how many times a message has been retried, or persisting attempts:
var reviewResult = wf.execute("analyze-review", review)
.withRetries(10)
.withExponentialBackoff(ExponentialBackoffRetryPolicy.newBuilder()
.setBaseIntervalMs(5000)
.setMultiplier(2.0F)
.build());
Processing the DLQ
Since the workflow orchestrator automatically persists the state of all workflows for debugging purposes, you don't even need a DLQ! You can quite simply search for WfRun
's with the ERROR
status in the dashboard.
If you want to manually save a failed WfRun
after the outage is resolved, you can do so with the lhctl rescue
command.
In case we must process failed reviews (for example, notify the customer to say that the review was unable to be processed), we can do so using LittleHorse's Failure Handling feature, as follows:
wf.handleError(reviewResult, handler -> {
wf.execute("notify-customer-review-failed", userId);
wf.fail("review-not-processed"); // fail the workflow
});
Wrapping Up
Retries and DLQ's are almost always necessary if you require reliable message processing. While they can be tedious to implement, LittleHorse allows you to enjoy their benefits without the frustrating (and error-prone) toil.
Get Involved!
Stay tuned for the next post, which will cover how you can write an application that must wait for something external to happen! In the meantime:
- Try out our Quickstarts
- Join us on Slack
- Give us a star on GitHub!