Intermediate
Introduction: Why Messages Fail and Why You Need a Safety Net
In any distributed system built on message queues, failure is not a question of if but when. A downstream service goes down, a message payload is malformed, a Lambda function hits a timeout — and suddenly messages start bouncing. Without a dead-letter queue (DLQ), those messages silently disappear after exhausting their receive attempts. You lose data, you lose visibility, and you lose sleep.
Amazon SQS Dead-Letter Queues solve this by acting as a parking lot for messages that can’t be processed successfully. Instead of vanishing, failed messages land in a separate queue where you can inspect them, fix the underlying issue, and replay them back to the source queue — all without data loss.
This article is for intermediate AWS developers and DevOps engineers who already understand the basics of Amazon SQS and want to implement a robust failure-handling pattern. We’ll cover configuration, monitoring, replay strategies, and the real-world gotchas that documentation glosses over.
Prerequisites
- An AWS account with permissions to create and manage SQS queues, CloudWatch alarms, and IAM policies
- AWS CLI v2 installed and configured (
aws configure) - Basic familiarity with Amazon SQS concepts: queues, messages, visibility timeout, receive count
- Python 3.9+ with
boto3installed (for code examples)
Understanding the Dead-Letter Queue Lifecycle
Before writing any code, let’s be precise about how DLQs work in SQS. A dead-letter queue is just a standard SQS queue (or FIFO queue, if the source is FIFO) that you designate as the target for messages that fail processing. The mechanism is governed by a redrive policy attached to the source queue.
Here’s the lifecycle of a failing message:
- Step 1: A consumer receives a message from the source queue. The message becomes invisible for the duration of the
VisibilityTimeout. - Step 2: The consumer fails to process the message and does not delete it. After the visibility timeout expires, the message reappears in the queue.
- Step 3: SQS increments the message’s
ApproximateReceiveCountattribute. - Step 4: Steps 1-3 repeat until
ApproximateReceiveCountexceeds themaxReceiveCountdefined in the redrive policy. - Step 5: SQS moves the message to the dead-letter queue. The message retains its original body, attributes, and message ID.
Key detail: The message’s retention period in the DLQ is based on the DLQ’s MessageRetentionPeriod, but the clock started when the message was originally sent to the source queue. If your source queue has a 4-day retention and the message spent 3 days bouncing, it only has 1 day left in the DLQ — unless your DLQ has a longer retention period. Always set your DLQ’s retention period to 14 days (the maximum) to give yourself enough time to investigate.
Step-by-Step: Creating the Source Queue and Dead-Letter Queue
Step 1: Create the Dead-Letter Queue
You must create the DLQ before the source queue (or at least before attaching the redrive policy). The DLQ must be the same type as the source — standard DLQ for standard queues, FIFO DLQ for FIFO queues, and both must be in the same AWS account and region.
# Create the dead-letter queue with 14-day retention
aws sqs create-queue \
--queue-name orders-processing-dlq \
--attributes '{
"MessageRetentionPeriod": "1209600",
"VisibilityTimeout": "30"
}'
# Note the QueueUrl in the output — you'll need the ARN next
# Get the DLQ ARN
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/orders-processing-dlq \
--attribute-names QueueArn
Step 2: Create the Source Queue with a Redrive Policy
# Create the source queue with a redrive policy pointing to the DLQ
aws sqs create-queue \
--queue-name orders-processing \
--attributes '{
"MessageRetentionPeriod": "345600",
"VisibilityTimeout": "60",
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:orders-processing-dlq\",\"maxReceiveCount\":\"3\"}"
}'
The RedrivePolicy has two fields:
| Parameter | Description | Recommended Value |
|---|---|---|
deadLetterTargetArn |
The ARN of the DLQ | Your DLQ’s ARN |
maxReceiveCount |
Number of times a message can be received before being sent to the DLQ | 3-5 for most workloads |
Choosing maxReceiveCount: Setting this to 1 means any single failure sends the message to the DLQ — too aggressive for transient errors. Setting it to 10+ means a poison pill message will clog your queue for an extended period. For most workloads, 3-5 is the sweet spot.
Step 3: Verify the Configuration
# Verify the redrive policy is set
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/orders-processing \
--attribute-names RedrivePolicy
# Verify which source queues are feeding into your DLQ
aws sqs list-dead-letter-source-queues \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/orders-processing-dlq
Monitoring Your DLQ with CloudWatch Alarms
A DLQ is useless if nobody knows messages are landing in it. The most important metric to monitor is ApproximateNumberOfMessagesVisible on the DLQ. If this goes above zero, something is wrong and you need to investigate.
# Create a CloudWatch alarm that triggers when any message appears in the DLQ
aws cloudwatch put-metric-alarm \
--alarm-name "orders-dlq-messages-alarm" \
--alarm-description "Alert when messages appear in orders DLQ" \
--namespace "AWS/SQS" \
--metric-name "ApproximateNumberOfMessagesVisible" \
--dimensions Name=QueueName,Value=orders-processing-dlq \
--statistic Sum \
--period 60 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--treat-missing-data notBreaching
Why treat-missing-data notBreaching? SQS only publishes metrics when there’s activity. If no data points exist, you don’t want the alarm flapping to INSUFFICIENT_DATA and back. Treating missing data as “not breaching” keeps the alarm in OK state when the queue is idle.
For production workloads, I also recommend a second alarm on ApproximateNumberOfMessagesVisible with a higher threshold (e.g., 100) using a different SNS topic that pages on-call engineers. A handful of DLQ messages might be normal; a flood means something is fundamentally broken.
Replaying Messages: The Right Way
This is where most teams get it wrong. Once you’ve fixed the root cause of the failures, you need to move messages from the DLQ back to the source queue. There are two approaches.
Option 1: SQS Dead-Letter Queue Redrive (AWS Console and API)
In December 2021, AWS introduced dead-letter queue redrive as a native SQS feature. This is now the recommended approach. You can trigger it from the AWS Console (SQS → select DLQ → “Start DLQ redrive”) or programmatically:
import boto3
import time
sqs = boto3.client('sqs', region_name='us-east-1')
dlq_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/orders-processing-dlq'
# Start the redrive task — moves messages back to their original source queue
response = sqs.start_message_move_task(
SourceArn='arn:aws:sqs:us-east-1:123456789012:orders-processing-dlq',
MaxNumberOfMessagesPerSecond=50 # Throttle to avoid overwhelming consumers
)
task_handle = response['TaskHandle']
print(f"Redrive task started: {task_handle}")
# Poll for completion
while True:
result = sqs.list_message_move_tasks(
SourceArn='arn:aws:sqs:us-east-1:123456789012:orders-processing-dlq',
MaxResults=1
)
task = result['Results'][0]
status = task['Status']
approx_moved = task.get('ApproximateNumberOfMessagesMoved', 0)
approx_total = task.get('ApproximateNumberOfMessagesToMove', 0)
print(f"Status: {status} | Moved: {approx_moved}/{approx_total}")
if status in ('COMPLETED', 'FAILED', 'CANCELLED'):
break
time.sleep(5)
print("Redrive finished.")
The start_message_move_task API also accepts an optional DestinationArn parameter if you want to redirect messages to a different queue instead of the original source. This is useful for routing messages to a staging queue for testing before replaying to production.
Important: You can only have one active message move task per source ARN at a time. Attempting to start a second one will throw an error. You can cancel an in-progress task using cancel_message_move_task.
Option 2: Manual Replay Script (For Custom Logic)
Sometimes you need to transform, filter, or log messages before replaying them. In that case, a custom script gives you full control:
import boto3
import json
sqs = boto3.client('sqs', region_name='us-east-1')
dlq_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/orders-processing-dlq'
source_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/orders-processing'
messages_replayed = 0
messages_skipped = 0
while True:
response = sqs.receive_message(
QueueUrl=dlq_url,
MaxNumberOfMessages=10, # Max batch size
WaitTimeSeconds=5, # Long polling
MessageAttributeNames=['All'],
AttributeNames=['All']
)
messages = response.get('Messages', [])
if not messages:
print("No more messages in DLQ.")
break
for message in messages:
body = json.loads(message['Body'])
# Example: skip messages that are known to be invalid
if 'order_id' not in body:
print(f"Skipping malformed message: {message['MessageId']}")
messages_skipped += 1
# Delete the malformed message from DLQ so it doesn't block future redrives
sqs.delete_message(
QueueUrl=dlq_url,
ReceiptHandle=message['ReceiptHandle']
)
continue
# Replay the message to the source queue
# Preserve original message attributes if present
msg_attributes = message.get('MessageAttributes', {})
sqs.send_message(
QueueUrl=source_url,
MessageBody=message['Body'],
MessageAttributes=msg_attributes
)
# Delete from DLQ only after successful send
sqs.delete_message(
QueueUrl=dlq_url,
ReceiptHandle=message['ReceiptHandle']
)
messages_replayed += 1
print(f"Replay complete. Replayed: {messages_replayed}, Skipped: {messages_skipped}")
Critical warning: In the manual approach, always delete the message from the DLQ after successfully sending it to the source queue. If you reverse the order and the send fails, you lose the message permanently. For extra safety in high-stakes workloads, consider writing the message to an S3 backup before deleting it from the DLQ.
Common Mistakes and How to Avoid Them
Mistake 1: DLQ Retention Period Too Short
As mentioned earlier, the message timestamp doesn’t reset when it moves to the DLQ. If your source queue has a 4-day retention and your DLQ also has a 4-day retention, messages that took time to exhaust their retries will expire from the DLQ before you can react. Fix: Always set DLQ retention to 14 days (1209600 seconds).
Mistake 2: Mixing Queue Types
A standard queue cannot use a FIFO queue as its DLQ, and vice versa. SQS will reject the redrive policy with an InvalidParameterValue error. Fix: Match the queue type. Use .fifo suffix for both or neither.
Mistake 3: No Alarm on the DLQ
I’ve seen teams configure DLQs and then forget about them for months. Messages pile up and eventually expire, defeating the entire purpose. Fix: Always pair a DLQ with a CloudWatch alarm, as shown above.
Mistake 4: Setting maxReceiveCount Too Low
A maxReceiveCount of 1 means a single transient error (network blip, Lambda cold start timeout) sends the message to the DLQ. Your DLQ fills up with messages that would have succeeded on retry. Fix: Use 3