Bite-Sized Serverless

Lambda Icon

Async Lambda Function Retries with Backoff and Jitter

Lambda - Advanced (300)
Asynchronous Lambda Functions process events without providing direct feedback to the caller. To maximize the chance of successful processing, the Lambda Function can retry the event in case of failure. However, the default failure handler only supports up to two retries and no jitter. In this post we'll implement a custom retry handler with exponential backoff, full jitter, and delays up to 12 hours.
The principles and benefits of exponential backoff and jitter are well described in Marc Brooker's articles Exponential Backoff And Jitter and Timeouts, retries, and backoff with jitter. I strongly suggest you check out those posts, they are great reads.
This article is backed by a CDK project, available on GitHub and as a Zip file at the bottom of this page.

Asynchronous Lambda Functions

Lambda Functions have two invocation types: synchronous (RequestResponse) and asynchronous (Event). When the RequestResponse method is used, the caller executes the Lambda Function directly and waits for it to complete. When the Event method is used, the caller's event is put onto a queue and a success response is returned immediately. The messages on the queue are then sent to the Lambda Function asynchronously.
Because processing a message can fail, asynchronous Lambda Functions can be configured to attempt either one or two retries. When enabled, a failed message is returned to the queue and reprocessed after one minute. If it fails again, it is retried after an additional two minutes. If it fails a fourth time, the message is moved to a failure destination.
In some cases, two retries are not enough. An asynchronous Lambda Function might call an external system that is unavailable for a significant amount of time. In this scenario, you want the event to be retried for minutes, maybe even hours after the initial reception. The solution described in this article offers exactly that: retries up to 24 hours, with exponential backoff and jitter, and delays between tries of up to 12 hours.

Message Visibility vs Delay

This solution uses the SQS Visibility Timeout feature to implement exponential backoffs. Visibility timeouts allow us to keep a message on a queue for up to 12 hours before it is made available for processing. Visibility timeouts have much in common with SQS Delays, but there are two distinct differences:
  1. A delay can be set when a message is put onto an SQS Queue, while visibility timeouts can only be set when a message is read from an SQS Queue. (We'll leave default queue delays and timeouts out of scope as they are not relevant to our use case).
  2. A delay can be set to up to 900 seconds (15 minutes), while a visibility timeout can be set to up to 43200 seconds (12 hours).
As the name implies, exponential backoff doubles the time it waits between every retry. If the base value (the initial delay) is one second, the retries will take place at 1, 2, 4, 8, 16, 32 seconds after the initial attempt, and so on. Our solution supports retries for a period of up to 24 hours. Continuing our exponential pattern, these retries should take place at 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, and 32768 seconds after the initial attempt. This picture changes a bit with jitter, but more about that later. Because SQS only allows delays up to 900 seconds, the higher values in this range cannot be used. If we would cap the backoff at 900, the retries would take place at 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 900, 900, 900, 900, 900, 900 seconds, and so on up to 24 hours, which isn't quite exponential. Values up to 32768 seconds easily fit in the visibility timeout range, making it the preferred solution.

The retry queue

The first component required for our exponential backoff solution is a failure destination for the async Lambda Function. This destination is configured for 0 retries (we don't want to use the native retry functionality) and an SQS Queue as its target.
1# The first queue, used to manage retries 2retry_queue = sqs.Queue(scope=self, id="RetryQueue") 3 4# The example async Lambda Function. This function will always fail, so 5# messages are always sent to the failure queue. 6async_lambda_function = LambdaFunction( 7 scope=self, 8 construct_id="AsyncLambda", 9 code=lambda_.Code.from_asset("lambda_functions/async_lambda"), 10) 11 12# Set the retry queue as the failure destination for the Lambda Function. 13async_lambda_function.function.configure_async_invoke( 14 retry_attempts=0, 15 on_failure=lambda_destinations.SqsDestination(queue=retry_queue), 16)

The retry handler

A second Lambda Function retrieves the messages as soon as they arrive on the retry queue. This is achieved through a Lambda Event Source Mapping, which continuously polls the queue and forwards any found messages to the Lambda Function synchronously. More details about Event Source Mappings can be found in the Bite Filter DynamoDB Event Streams Sent to Lambda .
When the message is first received by the Retry Handler, no delay or backoff has been applied. The message should thus be returned to the queue with a visibility timeout matching the exponential backoff: 1 second for the first retry attempt, 2 seconds for the second attempt, 4 seconds for the third attempt, and so on.
1# Example 1: base_backoff = 1, event_retry_count = 1 2# visibility_timeout = 1 * 2 ^ (1-1) = 1 * 2^0 = 1 * 1 = 1 second 3visibility_timeout = event_base_backoff * 2 ** (event_retry_count - 1) 4 5# Change the visibility of the SQS message 6sqs_client.change_message_visibility( 7 QueueUrl=retry_queue_url, 8 ReceiptHandle=record["receiptHandle"], 9 VisibilityTimeout=visibility_timeout, 10)
When the message visibility has been changed, the event should be returned to the queue. To achieve this, the Lambda Function uses the Event Source Mapping Partial Batch Failure Reporting feature. When a message is reported as a failure, it is returned to the queue for reprocessing at a later time (as determined by the visibility timeout above).
1returned_message_ids = [] 2 3for sqs_record in event["Records"]: 4 try: 5 handle_record(sqs_record) 6 except InitialReceiveError: 7 returned_message_ids.append(sqs_record["messageId"]) 8 9# Report messages for which the timeout has changed as failures, 10# so that they are put back onto the queue. 11return { 12 "batchItemFailures": [ 13 {"itemIdentifier": identifier} for identifier in returned_message_ids 14 ] 15}
When the event is received by the Retry Handler a second time, we know the visibility timeout (the backoff) has expired, so the event can be sent to the asynchronous Lambda Function again.
1# Retrieve the ApproximateReceiveCount from the SQS Retry Queue 2sqs_approximate_receive_count = int(record["attributes"]["ApproximateReceiveCount"]) 3# If ApproximateReceiveCount is equal or lower than 1 (defensive), this is the first time we 4# fetched it from SQS, and it should be put back with a visibility timeout 5first_receive_from_sqs = sqs_approximate_receive_count <= 1 6 7if first_receive_from_sqs: 8 # Return the message to the queue with the required backoff 9 return_sqs_message_with_backoff(record, event_retry_count) 10 raise InitialReceiveError() 11else: 12 # Update the payload and call the Lambda Function again 13 retry_lambda_execution(lambda_function_payload, original_event_timestamp)
We use the ApproximateReceiveCount to determine if the visibility timeout has been applied and expired. The fact that the receive count is approximate is not a problem. It will be correct most of the time, but when it's too low (1 instead of 2), the event will simply be backed off an additional time. When it's too high (2+ instead of 1), the event will immediately be retried, which is no disaster either.
The interaction between the Retry Queue and Retry Handler is described in the following flowchart.

Keeping track of retries

The Retry Handler needs to know how many retries have been attempted to determine the duration of the backoff. There is no obvious way for the Retry Queue or Retry Handler to determine the retry attempt count, so we need a custom solution: the retry envelope. This envelope wraps the original event and adds additional retry information. If the original event looks like this:
1{ 2 "s3_bucket": "my_bucket", 3 "s3_object_key": "demo.png" 4}
Then the payload after the first retry, put into a retry envelope, looks like this:
1{ 2 "_retry_metadata": { 3 "attempt": 2, 4 "initial_timestamp": 1643667670 5 }, 6 "_original_payload": { 7 "s3_bucket": "my_bucket", 8 "s3_object_key": "demo.png" 9 } 10}
Both the asynchronous Lambda Function and Retry Handler need a bit of additional logic to unwrap the envelope if it is present. In the async function:
1def event_handler(event, _context): 2 original_payload = event 3 if "_retry_metadata" in event: 4 original_payload = event["_original_payload"]
And in the Retry Handler:
1if "_retry_metadata" not in lambda_function_payload: 2 # If `_retry_metadata` is not present, this is the first time the 3 # Lambda Function failed, and we're currently processing the first 4 # retry. 5 event_retry_count = 1 6 original_event_timestamp = int( 7 datetime.timestamp( 8 datetime.strptime(sqs_body["timestamp"], "%Y-%m-%dT%H:%M:%S.%fZ") 9 ) 10 ) 11else: 12 # If `_retry_metadata` is present, this is the second or later time the 13 # Lambda Function failed. We can get the retry count from the metadata. 14 event_retry_count = lambda_function_payload["_retry_metadata"]["attempt"] + 1 15 original_event_timestamp = lambda_function_payload["_retry_metadata"][ 16 "initial_timestamp" 17 ]
The values in the retry envelope allow us to determine whether MAX_AGE or MAX_RETRIES have been exceeded, and they allow us to determine the backoff before the next retry. This completes the implementation of the 24-hour retry solution. Our flowchart, updated with these additional steps, looks like this:

Caveats

Jitter and processing delays

The backoff mechanism implemented in this solution applies what is known as "full jitter".
1visibility_timeout = event_base_backoff * 2 ** (event_retry_count - 1) 2 3# Add jitter by selecting a random value between the base backoff (generally 1) 4# and the visibility timeout generated above. 5timeout_with_jitter = round(random.uniform(event_base_backoff, visibility_timeout))
Jitter adds a random component to the exponential backoff. Instead of applying the exact exponential backoff value, it uses a value between 1 and the exponential value. The delay for the first round will be 1 second, for the second round it will be either 1 or 2 seconds, and for the third round it will be 1, 2, 3 or 4 seconds, and so on. In almost every case, the applied delay will be lower than the exponential backoff value. The random factor distributes the retries over time. This reduces the load on the target system, while never exceeding the exponential backoff delay.
On the other hand, our solution introduces several delays, including cold starts and processing times. The most significant delays are introduced by the two Queue-to-Lambda-Function integrations, as marked by (A) and (B) in the diagram below.
The Event Source Mapping polls the queue for a configured amount of time (B). We configured the batch window at 1 second, but the documentation states:
If you're using a batch window and your SQS queue contains very low traffic, Lambda might wait for up to 20 seconds before invoking your function. This is true even if you set a batch window lower than 20 seconds.
In effect, the low-range backoffs like 1, 2 and 4 seconds might take a lot longer, especially in low-volume scenarios. Benchmarks showed that in low-volume use cases, the actual delay between retries was about 16 seconds longer than the visibility timeout. The medium and high backoff ranges don't suffer from this problem too much, because the additional delays are generally compensated for by the jitter timeout reduction.
However, if you're looking for evenly distributed backoffs or reliable low-latency backoffs, this is not the solution for you. Instead, take a look at Step Functions.

In-Flight Messages per Standard Queue

The maximum number of in-flight messages (messages which are on the queue, but not visible) is 120,000. This limit cannot be increased. If your use case has higher volume requirements, look at using parallel queues, or use Step Functions, which supports 1,000,000 parallel executions by default. This limit can be increased up to millions of parallel executions.

Multi-day executions

The implementation in this article is limited to retries up to 24 hours. It can be increased up to 14 days, which is the SQS maximum message retention period. Keep in mind that the maximum retry delay is still 12 hours, which means the delay between retries after day one will generally be 12 hours. If you require retries for a longer period than 14 days, look at Step Functions instead, which can run up to one year.

Conclusion

The solution described in this Bite describes a reliable, cost-effective way to implement retries of up to 24 hours. The chart above shows many retries in the first minute, followed by a pretty even spread in the 24 hours after. The solution is not very accurate, but that shouldn't be an issue for a retry mechanism. If you need more accuracy or need delays longer than 12 hours, you should look at Step Functions instead.
The retry envelope adds a bit of complexity to the source async function. As a result, the exponential backoff solution is not a drop-in replacement for the basic retry mechanism. Depending on your use case, this might be a small price to pay for a significant reliability improvement.
Overall, retries with exponential backoff and jitter can significantly improve the reliability of your asynchronous Lambda Functions. SQS provides the foundation for a reliable retry solution, but you should be aware of its limits and how they apply to your use cases.

CDK Project

The services and code described in this Bite are available as a Python AWS Cloud Development Kit (CDK) Project. Within the project, execute a cdk synth to generate CloudFormation templates. Then deploy these templates to your AWS account with a cdk deploy. For your convenience, ready-to-use CloudFormation templates are also available in the cdk.out folder. For further instructions how to use the CDK, see Getting started with the AWS CDK.

Click the Download button below for a Zip file containing the project.