Search documentation

Search for pages in the documentation

Dead Letter Queue

Understanding and managing failed message processing

The Dead Letter Queue captures messages that fail to process after all retry attempts. This ensures failed messages aren't lost and can be investigated and retried.

What is a Dead Letter Queue?

A DLQ is a holding area for messages that couldn't be processed successfully:

text
Message → Processing → Success ✓
              ↓
           Failure → Retry → Retry → Retry → DLQ

When a message exhausts all retry attempts, it's moved to the DLQ instead of being discarded.

Why Messages Go to DLQ

Common Causes

CauseDescriptionResolution
Integration failuresExternal service unavailableWait and retry
Authentication errorsOAuth token expiredReconnect integration
Invalid dataMessage payload corruptedFix data source
Rate limitingToo many API callsAdjust workflow timing
TimeoutOperation took too longOptimize or increase timeout
Validation errorsInput doesn't match schemaFix upstream data

Message Lifecycle

  1. Message Created - Event or action generates message
  2. Processing Attempt - System tries to process
  3. Failure - Processing fails with error
  4. Retry - System waits and retries (up to max attempts)
  5. DLQ - After max retries, message moves to DLQ

DLQ Components

The platform has multiple DLQs for different message types:

System Event Outbox DLQ

For messages from event triggers:

text
SystemEventOutbox
├── messageKey: unique identifier
├── payload: event data
├── attempts: number of tries
├── lastError: error message
└── deadAt: when moved to DLQ

Workflow Triggered Outbox DLQ

For workflow execution messages:

text
WorkflowTriggeredOutbox
├── messageKey: unique identifier
├── payload: workflow/version info
├── attempts: number of tries
├── lastError: error message
└── deadAt: when moved to DLQ

Execution Node Ready Outbox DLQ

For node execution messages:

text
ExecutionNodeReadyOutbox
├── messageKey: unique identifier
├── sessionKey: execution session
├── payload: node execution info
├── attempts: number of tries
├── lastError: error message
└── deadAt: when moved to DLQ

Viewing DLQ Contents

Identifying DLQ Messages

Messages in the DLQ have:

  • deadAt timestamp (not null)
  • attempts equal to max retry count
  • lastError with failure details

Key Information

When investigating DLQ messages:

FieldPurpose
messageKeyUnique identifier for tracking
payloadOriginal message content
attemptsHow many times processing was tried
lastErrorError from last attempt
deadAtWhen message entered DLQ
createdAtWhen message was originally created

Retry Configuration

Default Retry Policy

json
{
  "max_attempts": 3,
  "strategy": "exponential_jitter",
  "backoff_ms": 1000,
  "max_backoff_ms": 30000
}

Retry Strategies

StrategyDescriptionUse Case
exponential_jitterExponential backoff with randomizationMost situations (default)
exponentialPure exponential backoffPredictable retry timing
fixedSame delay each timeSimple retry pattern
noneNo retriesOne-shot operations

Backoff Calculation

Exponential with Jitter:

text
delay = min(base * 2^attempt + random(0, jitter), max_backoff)

Example Sequence (1s base, 30s max):

  • Attempt 1: ~1-2 seconds
  • Attempt 2: ~2-4 seconds
  • Attempt 3: ~4-8 seconds
  • Attempt 4+: Up to 30 seconds

Managing DLQ Messages

Investigation Steps

  1. Identify the Pattern

    • Are multiple messages failing?
    • Same error message?
    • Same time period?
  2. Check the Error

    • Read lastError carefully
    • Is it transient or permanent?
    • External service issue?
  3. Review the Payload

    • Is the data valid?
    • Any unexpected values?
    • Schema changes?
  4. Check External Systems

    • Integration still connected?
    • API available?
    • Rate limits hit?

Common Resolution Actions

IssueAction
Expired OAuthReconnect integration
Service outageWait and retry
Bad dataFix source, discard message
Rate limitRetry with delay
Schema changeUpdate workflow, reprocess

Retrying DLQ Messages

To retry messages in the DLQ:

  1. Fix the underlying issue
  2. Reset the message status
  3. Clear deadAt timestamp
  4. Reset attempt counter
  5. Set new availableAt time

Discarding DLQ Messages

Some messages should not be retried:

  • Invalid data that can't be fixed
  • Obsolete events (e.g., old meeting reminders)
  • Duplicate messages

To discard, mark the message as processed without retrying.

Preventing DLQ Issues

1. Design Robust Workflows

Build error handling into workflows:

text
[Action] → [On Error] → [Handle Error]
               ↓
       [Alternative Path]

2. Use Appropriate Retry Settings

Match retry config to the operation:

Operation TypeRecommended Config
External API3 attempts, exponential_jitter
AI Processing2 attempts, exponential
Non-critical1 attempt, none
Critical5 attempts, exponential_jitter

3. Validate Input Data

Check data before processing:

text
[Load Data] → [Validate] → [If: Valid?]
                           ├─ Yes ─→ [Process]
                           └─ No ──→ [Handle Invalid]

4. Handle Rate Limits

Add delays between operations:

text
[Process Item] → [Wait: 1 second] → [Next Item]

5. Monitor Integration Health

Proactively check integration status:

  • OAuth token expiration
  • API availability
  • Usage limits

DLQ Monitoring

Key Metrics

Monitor these indicators:

MetricHealthyWarningCritical
DLQ count01-10>10
DLQ growth rate0/hour1-5/hour>5/hour
Time in DLQ<1 hour1-24 hours>24 hours

Alerting Patterns

Set up alerts for:

  1. New DLQ entries - Investigate immediately
  2. DLQ count threshold - Review when count exceeds limit
  3. Old DLQ entries - Messages stuck too long

Best Practices

1. Review DLQ Regularly

Check DLQ daily or set up automated monitoring.

2. Fix Root Causes

Don't just retry - understand why messages failed.

3. Document Common Issues

Keep a runbook of common DLQ issues and resolutions.

4. Test Error Paths

Test your workflows with failure scenarios.

5. Set Appropriate Retry Limits

Balance reliability with resource usage:

json
// For transient failures (external APIs)
{ "max_attempts": 5, "strategy": "exponential_jitter" }

// For permanent failures (validation)
{ "max_attempts": 1, "strategy": "none" }

Troubleshooting

High DLQ Volume

Symptoms: Many messages in DLQ Causes:

  • External service outage
  • Authentication failure
  • Rate limiting

Resolution:

  1. Identify common error pattern
  2. Fix root cause
  3. Bulk retry after fix

Messages Keep Failing on Retry

Symptoms: Retried messages go back to DLQ Causes:

  • Permanent error not transient
  • Root cause not fixed
  • Invalid message data

Resolution:

  1. Review error carefully
  2. Check if data can be fixed
  3. Consider discarding if unfixable

DLQ Growing During Normal Operation

Symptoms: Slow DLQ growth without outage Causes:

  • Edge cases in data
  • Intermittent failures
  • Configuration issues

Resolution:

  1. Review failing message types
  2. Identify patterns
  3. Add workflow error handling