Dead Letter Queue
Understanding and managing failed message processing
The Dead Letter Queue captures messages that fail to process after all retry attempts. This ensures failed messages aren't lost and can be investigated and retried.
What is a Dead Letter Queue?
A DLQ is a holding area for messages that couldn't be processed successfully:
Message → Processing → Success ✓
↓
Failure → Retry → Retry → Retry → DLQ
When a message exhausts all retry attempts, it's moved to the DLQ instead of being discarded.
Why Messages Go to DLQ
Common Causes
| Cause | Description | Resolution |
|---|---|---|
| Integration failures | External service unavailable | Wait and retry |
| Authentication errors | OAuth token expired | Reconnect integration |
| Invalid data | Message payload corrupted | Fix data source |
| Rate limiting | Too many API calls | Adjust workflow timing |
| Timeout | Operation took too long | Optimize or increase timeout |
| Validation errors | Input doesn't match schema | Fix upstream data |
Message Lifecycle
- Message Created - Event or action generates message
- Processing Attempt - System tries to process
- Failure - Processing fails with error
- Retry - System waits and retries (up to max attempts)
- DLQ - After max retries, message moves to DLQ
DLQ Components
The platform has multiple DLQs for different message types:
System Event Outbox DLQ
For messages from event triggers:
SystemEventOutbox
├── messageKey: unique identifier
├── payload: event data
├── attempts: number of tries
├── lastError: error message
└── deadAt: when moved to DLQ
Workflow Triggered Outbox DLQ
For workflow execution messages:
WorkflowTriggeredOutbox
├── messageKey: unique identifier
├── payload: workflow/version info
├── attempts: number of tries
├── lastError: error message
└── deadAt: when moved to DLQ
Execution Node Ready Outbox DLQ
For node execution messages:
ExecutionNodeReadyOutbox
├── messageKey: unique identifier
├── sessionKey: execution session
├── payload: node execution info
├── attempts: number of tries
├── lastError: error message
└── deadAt: when moved to DLQ
Viewing DLQ Contents
Identifying DLQ Messages
Messages in the DLQ have:
deadAttimestamp (not null)attemptsequal to max retry countlastErrorwith failure details
Key Information
When investigating DLQ messages:
| Field | Purpose |
|---|---|
messageKey | Unique identifier for tracking |
payload | Original message content |
attempts | How many times processing was tried |
lastError | Error from last attempt |
deadAt | When message entered DLQ |
createdAt | When message was originally created |
Retry Configuration
Default Retry Policy
{
"max_attempts": 3,
"strategy": "exponential_jitter",
"backoff_ms": 1000,
"max_backoff_ms": 30000
}
Retry Strategies
| Strategy | Description | Use Case |
|---|---|---|
exponential_jitter | Exponential backoff with randomization | Most situations (default) |
exponential | Pure exponential backoff | Predictable retry timing |
fixed | Same delay each time | Simple retry pattern |
none | No retries | One-shot operations |
Backoff Calculation
Exponential with Jitter:
delay = min(base * 2^attempt + random(0, jitter), max_backoff)
Example Sequence (1s base, 30s max):
- Attempt 1: ~1-2 seconds
- Attempt 2: ~2-4 seconds
- Attempt 3: ~4-8 seconds
- Attempt 4+: Up to 30 seconds
Managing DLQ Messages
Investigation Steps
-
Identify the Pattern
- Are multiple messages failing?
- Same error message?
- Same time period?
-
Check the Error
- Read
lastErrorcarefully - Is it transient or permanent?
- External service issue?
- Read
-
Review the Payload
- Is the data valid?
- Any unexpected values?
- Schema changes?
-
Check External Systems
- Integration still connected?
- API available?
- Rate limits hit?
Common Resolution Actions
| Issue | Action |
|---|---|
| Expired OAuth | Reconnect integration |
| Service outage | Wait and retry |
| Bad data | Fix source, discard message |
| Rate limit | Retry with delay |
| Schema change | Update workflow, reprocess |
Retrying DLQ Messages
To retry messages in the DLQ:
- Fix the underlying issue
- Reset the message status
- Clear
deadAttimestamp - Reset attempt counter
- Set new
availableAttime
Discarding DLQ Messages
Some messages should not be retried:
- Invalid data that can't be fixed
- Obsolete events (e.g., old meeting reminders)
- Duplicate messages
To discard, mark the message as processed without retrying.
Preventing DLQ Issues
1. Design Robust Workflows
Build error handling into workflows:
[Action] → [On Error] → [Handle Error]
↓
[Alternative Path]
2. Use Appropriate Retry Settings
Match retry config to the operation:
| Operation Type | Recommended Config |
|---|---|
| External API | 3 attempts, exponential_jitter |
| AI Processing | 2 attempts, exponential |
| Non-critical | 1 attempt, none |
| Critical | 5 attempts, exponential_jitter |
3. Validate Input Data
Check data before processing:
[Load Data] → [Validate] → [If: Valid?]
├─ Yes ─→ [Process]
└─ No ──→ [Handle Invalid]
4. Handle Rate Limits
Add delays between operations:
[Process Item] → [Wait: 1 second] → [Next Item]
5. Monitor Integration Health
Proactively check integration status:
- OAuth token expiration
- API availability
- Usage limits
DLQ Monitoring
Key Metrics
Monitor these indicators:
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| DLQ count | 0 | 1-10 | >10 |
| DLQ growth rate | 0/hour | 1-5/hour | >5/hour |
| Time in DLQ | <1 hour | 1-24 hours | >24 hours |
Alerting Patterns
Set up alerts for:
- New DLQ entries - Investigate immediately
- DLQ count threshold - Review when count exceeds limit
- Old DLQ entries - Messages stuck too long
Best Practices
1. Review DLQ Regularly
Check DLQ daily or set up automated monitoring.
2. Fix Root Causes
Don't just retry - understand why messages failed.
3. Document Common Issues
Keep a runbook of common DLQ issues and resolutions.
4. Test Error Paths
Test your workflows with failure scenarios.
5. Set Appropriate Retry Limits
Balance reliability with resource usage:
// For transient failures (external APIs)
{ "max_attempts": 5, "strategy": "exponential_jitter" }
// For permanent failures (validation)
{ "max_attempts": 1, "strategy": "none" }
Troubleshooting
High DLQ Volume
Symptoms: Many messages in DLQ Causes:
- External service outage
- Authentication failure
- Rate limiting
Resolution:
- Identify common error pattern
- Fix root cause
- Bulk retry after fix
Messages Keep Failing on Retry
Symptoms: Retried messages go back to DLQ Causes:
- Permanent error not transient
- Root cause not fixed
- Invalid message data
Resolution:
- Review error carefully
- Check if data can be fixed
- Consider discarding if unfixable
DLQ Growing During Normal Operation
Symptoms: Slow DLQ growth without outage Causes:
- Edge cases in data
- Intermittent failures
- Configuration issues
Resolution:
- Review failing message types
- Identify patterns
- Add workflow error handling