Back to blog
Backend Systemsadvanced

Event-Driven Architecture: SQS, SNS, and EventBridge — Complete Guide with Real Examples

Everything about event-driven architecture on AWS and Azure. SQS queues, SNS fan-out, EventBridge routing — how they work, when to use each, how they combine, what breaks in production, and real flows from Uber, Amazon, Netflix, and healthcare platforms.

SystemForgeApril 21, 202625 min read
Event-Driven ArchitectureSQSSNSEventBridgeAWSAzure Service BusAzure Event GridMicroservicesSystem DesignReal WorldInterview Prep
Share:𝕏

What Is Event-Driven Architecture?

In a traditional request-response system, Service A calls Service B directly:

Service A ──── HTTP POST ────► Service B
               waits...
Service A ◄─── response ─────  Service B

Service A blocks until Service B responds. If Service B is slow, Service A is slow. If Service B is down, Service A fails. They are tightly coupled.

In an event-driven system, Service A publishes an event and moves on:

Service A ──── publishes event ────► Message Bus
                                          │
                                    ┌─────┴─────┐
                                    ▼           ▼
                               Service B   Service C
                            (processes    (processes
                             when ready)   when ready)

Service A does not know who consumes the event. It does not wait. If Service B is down, the event waits in the queue — nothing is lost. Service C is completely unaware of Service B. Adding a new Service D requires zero changes to A, B, or C.

Real-world analogy: A newspaper publisher. The publisher prints the news (publishes an event) and distributes it. Readers (consumers) read it at their own pace. Adding a new subscriber does not require the publisher to change anything. If one reader is on holiday (service down), their paper waits at the door.


The Three Services — What Each One Is

Before going deep, understand the fundamental role of each:

SQS (Simple Queue Service)
  → Point-to-point: one message, one consumer, processed once
  → "Do this specific task" — a job queue
  → The message is deleted after successful processing
  → Azure equivalent: Service Bus Queues

SNS (Simple Notification Service)
  → Fan-out: one message, many consumers, all receive it simultaneously
  → "Something happened, everyone who cares should know"
  → Message is not stored — subscribers either receive it or miss it
  → Azure equivalent: Service Bus Topics + Subscriptions

EventBridge
  → Smart routing: events flow through rules that filter and route by content
  → "Route this event to different places based on what it contains"
  → Connects AWS services to each other and to custom applications
  → Azure equivalent: Event Grid

They are not alternatives. They solve different problems. In production systems you almost always use all three together.


SQS — Deep Dive

How SQS Works

Producer                Queue                   Consumer
   │                      │                         │
   ├── send message ──────►│                         │
   │                       │◄── receive messages ────┤
   │                       │                         │ (message invisible
   │                       │                         │  to others for
   │                       │                         │  visibility timeout)
   │                       │◄── delete message ──────┤ (after processing)
   │                       │                         │
   │                       ├── (N retries fail) ─────►DLQ

Key concepts:

Visibility Timeout: When a consumer reads a message, it becomes invisible to all other consumers for X seconds. If the consumer processes it and deletes it → done. If the consumer crashes without deleting it → message reappears after the timeout and another consumer picks it up. This is how SQS guarantees at-least-once delivery.

Timeline:
  t=0:    Consumer A reads message → message invisible for 30s
  t=0-30: Consumer A processes the message
  t=28:   Consumer A deletes the message → done
  OR:
  t=15:   Consumer A crashes
  t=30:   Visibility timeout expires → message reappears
  t=31:   Consumer B picks it up and processes it

Dead Letter Queue (DLQ): After N failed processing attempts (maxReceiveCount), the message moves to a separate DLQ. It sits there until an engineer investigates. Without a DLQ, failed messages cycle forever — blocking the queue and retrying endlessly.

SQS Standard vs FIFO

| | Standard Queue | FIFO Queue | |---|---|---| | Ordering | Best-effort (may be out of order) | Strict first-in-first-out | | Delivery | At-least-once (may deliver twice) | Exactly-once | | Throughput | Unlimited | 300 msg/s (3,000 with batching) | | Cost | Cheaper | 10× more expensive | | Use when | Order does not matter, idempotency handled in code | Order matters, or exactly-once required |

Use Standard for: sending emails, processing call recordings, generating reports — if the same email is sent twice it is annoying but not catastrophic, handled with idempotency checks.

Use FIFO for: financial transactions, state machine transitions, order processing — if the same charge executes twice you have a problem.


Real Example 1: Amazon Order Processing

When you place an order on Amazon, 12 things need to happen. Some can happen in parallel. Some must happen in a specific order. None of them should block your checkout experience.

You click "Place Order"
       ↓
Order Service:
  1. Saves order to database (synchronous — must complete before "Order confirmed" page)
  2. Returns "Order #112-3456789 confirmed!" to your browser
  3. Sends 12 messages to SQS queues (async — happens after response)

Queue assignments:
  InventoryQueue:
    Message: { orderId: "112-3456789", items: [...] }
    Consumer Lambda: reserve stock for each item
    If stock unavailable → email customer, cancel order item

  PaymentQueue:
    Message: { orderId: "112-3456789", amount: 89.99, paymentMethod: "card_visa" }
    Consumer Lambda: charge the card
    If charge fails → inventory release + customer notification (via another queue)

  FulfillmentQueue:
    Message: { orderId: "112-3456789", warehouseId: "LHR7" }
    Consumer Lambda: create pick list for warehouse worker

  NotificationQueue:
    Message: { orderId: "112-3456789", email: "customer@example.com" }
    Consumer Lambda: send confirmation email

  AnalyticsQueue:
    Message: { orderId: "112-3456789", value: 89.99, category: "electronics" }
    Consumer Lambda: update real-time sales dashboard

All 5 queues process in parallel.
If the analytics Lambda is slow: only the analytics queue backs up.
The payment and fulfilment queues are unaffected.
Customer gets their order on time.

What breaks without SQS (tight coupling):

You click "Place Order"
       ↓
Order Service calls:
  → InventoryService.Reserve() — 50ms
  → PaymentService.Charge() — 800ms (credit card network is slow)
  → FulfillmentService.CreatePickList() — 200ms
  → EmailService.SendConfirmation() — 300ms (email provider is slow)
  → AnalyticsService.Record() — 100ms

Total: 1,450ms before "Order confirmed" page appears

If EmailService is down:
  Your order fails with a 500 error
  Payment was NOT taken (rolled back)
  You have to re-order
  Amazon lost the sale because the email service was down

Real Example 2: MyBCAT Post-Call Workflow

A call ends. Six things must happen. None should block the others.

Python
# call_ended Lambda  triggered by Amazon Connect EventBridge event
def handler(event, context):
    call_data = event['detail']

    # Determine which queues get which payloads
    messages = [
        {
            'QueueUrl': TRANSCRIPTION_QUEUE,
            'MessageBody': json.dumps({
                'callId': call_data['contactId'],
                'recordingKey': call_data['recordingS3Key'],
                'practiceId': call_data['practiceId'],
                'agentId': call_data['agentId']
            }),
            'MessageGroupId': call_data['practiceId'],  # FIFO  process per practice in order
            'MessageDeduplicationId': call_data['contactId']  # exactly once
        },
        {
            'QueueUrl': CRM_UPDATE_QUEUE,
            'MessageBody': json.dumps({
                'callId': call_data['contactId'],
                'patientPhone': call_data['customerEndpoint'],
                'duration': call_data['disconnectTimestamp'] - call_data['initiationTimestamp'],
                'outcome': call_data['disconnectReason']
            })
        },
        {
            'QueueUrl': BILLING_QUEUE,
            'MessageBody': json.dumps({
                'practiceId': call_data['practiceId'],
                'callId': call_data['contactId'],
                'duration': call_data['duration'],
                'callType': call_data['queue']
            })
        }
    ]

    # Send all messages simultaneously
    for msg in messages:
        sqs.send_message(**msg)

    return {'statusCode': 200}
Python
# transcription Lambda  processes TranscriptionQueue
def handler(event, context):
    for record in event['Records']:  # SQS can batch deliver multiple messages
        body = json.loads(record['body'])

        try:
            # Download recording from S3
            audio = s3.get_object(
                Bucket='mybcat-recordings-prod',
                Key=body['recordingKey']
            )['Body'].read()

            # Transcribe with DeepGram
            transcript = deepgram_client.transcribe(audio, language='en-US')

            # Save transcript to S3 and update call record
            s3.put_object(
                Bucket='mybcat-transcripts-prod',
                Key=f"transcripts/{body['callId']}.json",
                Body=json.dumps(transcript)
            )

            dynamodb.update_item(
                Key={'PK': f"CALL#{body['callId']}"},
                UpdateExpression='SET transcriptKey = :key, transcriptStatus = :status',
                ExpressionAttributeValues={
                    ':key': f"transcripts/{body['callId']}.json",
                    ':status': 'COMPLETED'
                }
            )

        except Exception as e:
            logger.exception(f"Transcription failed for call {body['callId']}")
            raise  # Re-raise so SQS knows this message failed and should retry
            # After maxReceiveCount failures  moves to TranscriptionDLQ

Why raising the exception matters: If your Lambda catches an exception and returns normally, SQS thinks the message was processed successfully and deletes it. The failed transcription is silently lost. Raising the exception (or not catching it) signals failure — SQS keeps the message and retries.


SQS Configuration That Matters in Production

HCL
resource "aws_sqs_queue" "transcription" {
  name                       = "mybcat-transcription-${var.env}"
  visibility_timeout_seconds = 300   # How long consumer has to process
                                     # Must be > your Lambda timeout
                                     # If Lambda timeout = 60s, set this to 120s minimum

  message_retention_seconds  = 86400  # Keep messages for 1 day if not processed
  receive_wait_time_seconds  = 20     # Long polling  waits up to 20s for messages
                                      # Reduces empty receives, saves cost

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.transcription_dlq.arn
    maxReceiveCount     = 3  # After 3 failures  DLQ
  })
}

resource "aws_sqs_queue" "transcription_dlq" {
  name                      = "mybcat-transcription-dlq-${var.env}"
  message_retention_seconds = 1209600  # Keep failed messages for 14 days
}

# Alarm: DLQ has messages  something is broken
resource "aws_cloudwatch_metric_alarm" "transcription_dlq" {
  alarm_name          = "transcription-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 0
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  dimensions          = { QueueName = aws_sqs_queue.transcription_dlq.name }
  alarm_actions       = [aws_sns_topic.engineering_alerts.arn]
  alarm_description   = "Transcription DLQ has failed messages — investigate immediately"
}

# Lambda event source mapping
resource "aws_lambda_event_source_mapping" "transcription" {
  event_source_arn = aws_sqs_queue.transcription.arn
  function_name    = aws_lambda_function.transcribe_call.arn
  batch_size       = 5        # Process 5 messages per Lambda invocation
  maximum_batching_window_in_seconds = 30  # Wait up to 30s to batch 5 messages
}

Visibility timeout trap: If your Lambda timeout is 60 seconds and your visibility timeout is 30 seconds, the following happens:

  1. Lambda picks up the message (t=0)
  2. Lambda runs for 35 seconds processing
  3. Visibility timeout expires at t=30 — message becomes visible again
  4. A second Lambda picks up the same message (t=31)
  5. Both Lambdas are now processing the same message simultaneously

Always set visibility_timeout_seconds to at least 6× your Lambda timeout. This leaves room for retries within the timeout window.


SNS — Deep Dive

How SNS Works

SNS is a pub/sub system. Publishers send to a topic. Subscribers receive from a topic. Every subscriber gets every message.

Publisher
    │
    └──► SNS Topic
              │
         ┌────┴────┬──────┬──────┐
         ▼         ▼      ▼      ▼
      SQS Q    SQS Q   Lambda  Email
    (Consumer  (Consumer (direct  (human
       A)         B)    invoke) notification)

Every subscriber receives every message independently.
Consumer A being slow does not affect Consumer B.

Subscription types: SQS (most common), Lambda, HTTP endpoint, email, SMS, mobile push.

The golden pattern: SNS → SQS (fan-out)

Never subscribe a Lambda directly to SNS in production for high-volume events. If Lambda is throttled (concurrency limit hit), messages are dropped. Subscribe an SQS queue to SNS — the queue buffers messages. Lambda drains the queue at a controlled pace. Nothing is lost.

                    SNS Topic
                        │
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
       SQS Q1        SQS Q2        SQS Q3
          │             │             │
       Lambda A      Lambda B      Lambda C
    (transcription) (CRM update)  (billing)

Real Example 3: Uber's Trip Events

When a trip status changes, every Uber system needs to know. The trip lifecycle generates dozens of events: requested, driver accepted, driver arrived, trip started, trip ended, payment processed, receipt sent.

Trip event: TRIP_COMPLETED
{
  "tripId": "trip_789abc",
  "riderId": "rider_123",
  "driverId": "driver_456",
  "distance": 8.3,
  "duration": 1420,
  "fare": 22.50,
  "currency": "GBP",
  "pickupLocation": { "lat": 51.507, "lng": -0.127 },
  "dropoffLocation": { "lat": 51.524, "lng": -0.099 }
}
           ↓
    SNS Topic: trip-events
           ↓
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│          │          │          │          │          │
▼          ▼          ▼          ▼          ▼          ▼
SQS        SQS        SQS        SQS        SQS        SQS
Payment  Receipts  Analytics  RiderRating DriverPay  Safety
Queue    Queue      Queue      Queue       Queue      Queue

Each queue has its own Lambda consumer:
  Payment Consumer:    → Charge rider's card £22.50
  Receipts Consumer:   → Generate PDF receipt, email to rider
  Analytics Consumer:  → Update driver earnings dashboard, city heatmap
  RiderRating Consumer: → Send "Rate your trip" push notification (after 5 minutes)
  DriverPay Consumer:  → Add £22.50 × 0.75 (after commission) to driver's weekly earnings
  Safety Consumer:     → Log trip route, flag if unusual pattern detected

What makes this powerful:

  • Uber's safety team added a new "log trip route for safety analysis" feature. They added one SQS subscription to the SNS topic. Zero changes to the Trip Service, Payment Service, or any other service.
  • The Driver Pay calculation has a temporary bug that's crashing its Lambda. The payment consumer's queue backs up. Safety, receipts, analytics, and ratings all process normally.
  • Six months later, Uber adds a carbon footprint calculation feature. One new SQS subscription. No deployments to existing services.

Real Example 4: NHS — Patient Admission Events

A patient is admitted to hospital. Every department system needs to update simultaneously:

Python
# SNS publish  admission service
def admit_patient(patient_id: str, ward: str, doctor: str):
    # 1. Save to database (synchronous)
    db.execute(
        "INSERT INTO admissions (patient_id, ward, doctor, admitted_at) VALUES (%s, %s, %s, NOW())",
        (patient_id, ward, doctor)
    )

    # 2. Publish event (asynchronous  does not block admission)
    sns.publish(
        TopicArn=PATIENT_EVENTS_TOPIC,
        Message=json.dumps({
            'eventType': 'PATIENT_ADMITTED',
            'patientId': patient_id,
            'ward': ward,
            'attendingDoctor': doctor,
            'timestamp': datetime.utcnow().isoformat()
        }),
        MessageAttributes={
            'eventType': {
                'DataType': 'String',
                'StringValue': 'PATIENT_ADMITTED'
            }
        }
    )

    return {'status': 'admitted', 'patientId': patient_id}
SNS Topic: nhs-patient-events
          ↓
SQS subscription filter: eventType = "PATIENT_ADMITTED"
          ↓
┌─────────────┬──────────────┬────────────┬───────────┬──────────────┐
Ward Mgmt    Pharmacy      Catering     Billing    Lab Systems    Audit
(allocate    (prepare      (add to      (open       (create       (GDPR
 bed)        medications)   meal plan)   account)   test order    compliance
                                                    profile)      log)

SNS message filtering: Each subscription can filter messages by attributes. The billing queue only receives PATIENT_ADMITTED and PATIENT_DISCHARGED events — not MEDICATION_ADMINISTERED events. This reduces queue volume and irrelevant processing.

HCL
resource "aws_sns_topic_subscription" "billing" {
  topic_arn = aws_sns_topic.patient_events.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.billing.arn

  filter_policy = jsonencode({
    eventType = ["PATIENT_ADMITTED", "PATIENT_DISCHARGED", "PROCEDURE_COMPLETED"]
    # billing only sees events that affect the invoice
  })
}

resource "aws_sns_topic_subscription" "lab_systems" {
  topic_arn = aws_sns_topic.patient_events.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.lab.arn

  filter_policy = jsonencode({
    eventType = ["PATIENT_ADMITTED", "LAB_TEST_ORDERED", "SPECIMEN_COLLECTED"]
  })
}

Without filter policies, every system receives every event and wastes compute discarding irrelevant ones.


EventBridge — Deep Dive

How EventBridge Works

EventBridge is an event bus with routing rules. Events have a structure; rules match on that structure; matched events are routed to targets.

Event Source                Event Bus               Rule + Target
(what happened)        (where events flow)       (who gets what)

AWS services      ──►  default event bus  ──►  Rule: source=aws.s3
Your application  ──►  custom event bus   ──►  Rule: detail-type=OrderPlaced
Partner services  ──►  partner event bus  ──►  Rule: detail.status=FAILED
                              │
                    Rules are AND/OR filters
                    on the event's JSON structure
                              │
                         Targets:
                           Lambda, SQS, SNS, Step Functions,
                           API Gateway, another EventBridge bus,
                           CloudWatch Logs, Kinesis, and 20+ more

The Event Structure

Every EventBridge event has the same envelope:

JSON
{
  "version": "0",
  "id": "uuid-for-this-event",
  "source": "com.mybcat.scheduling",
  "account": "123456789",
  "time": "2026-04-21T14:23:01Z",
  "region": "us-east-1",
  "detail-type": "AppointmentBooked",
  "detail": {
    "appointmentId": "appt_abc123",
    "practiceId": "practice_p001",
    "patientId": "pat_789",
    "slotDate": "2026-04-24",
    "slotTime": "10:00",
    "type": "comprehensive_exam",
    "agentId": "user_sarah"
  }
}

Rules match on any combination of fields:

JSON
{
  "source": ["com.mybcat.scheduling"],
  "detail-type": ["AppointmentBooked"],
  "detail": {
    "type": ["comprehensive_exam", "contact_lens_fitting"],
    "practiceId": [{ "prefix": "practice_p" }]
  }
}

This rule matches only AppointmentBooked events for comprehensive exams or contact lens fittings at any practice starting with "practice_p". High specificity means each rule triggers only relevant Lambdas.


Real Example 5: GitHub CI/CD Pipeline Routing

When code is pushed to GitHub, different actions should happen based on branch name, changed files, and event type. Without EventBridge, you would write a monolithic Lambda with if/elif/else chains for every scenario. With EventBridge, each rule is independent.

GitHub webhook → API Gateway → normalize-event Lambda
                                        ↓
                              Custom EventBridge bus
                                        ↓
┌───────────────────────────────────────────────────────────────────┐
│                                                                   │
│  Rule 1: branch = "main"                                         │
│    Target: SQS → full-test-suite Lambda                          │
│                                                                   │
│  Rule 2: branch = "main" AND files contain "infrastructure/"     │
│    Target: SQS → terraform-plan Lambda                           │
│                                                                   │
│  Rule 3: branch matches "feature/*"                              │
│    Target: SQS → unit-tests-only Lambda                          │
│                                                                   │
│  Rule 4: event-type = "pull_request" AND action = "opened"       │
│    Target: Lambda → post-reviewer-assignment Lambda              │
│                                                                   │
│  Rule 5: tag matches "v*" (release tag)                          │
│    Target: Step Functions → production-deployment workflow        │
│                                                                   │
│  Rule 6: ALL events                                               │
│    Target: CloudWatch Logs → full audit trail                    │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘
Python
# normalize-event Lambda  converts GitHub webhook to EventBridge event
def handler(event, context):
    github_event = json.loads(event['body'])
    event_type = event['headers'].get('X-GitHub-Event', 'unknown')

    # Build standardised EventBridge event
    eb_event = {
        'Source': 'com.github.webhook',
        'DetailType': f'github.{event_type}',
        'Detail': json.dumps({
            'repository': github_event['repository']['full_name'],
            'branch': github_event.get('ref', '').replace('refs/heads/', ''),
            'pusher': github_event.get('pusher', {}).get('name'),
            'commits': github_event.get('commits', []),
            'changedFiles': extract_changed_files(github_event),
            'tag': extract_tag(github_event.get('ref', ''))
        }),
        'EventBusName': 'mybcat-cicd-events'
    }

    events_client.put_events(Entries=[eb_event])
    return {'statusCode': 200}

Each rule is independently testable, deployable, and maintainable. Adding a new CI step (e.g., security scanning on every PR) adds one new rule — no changes to existing Lambdas.


Real Example 6: Amazon Connect → MyBCAT Post-Call Orchestration

Amazon Connect natively publishes events to EventBridge. When a call ends, the event lands in the default event bus and flows through your routing rules.

Python
# The event Connect publishes (you do not control this  it is automatic)
{
  "version": "0",
  "source": "aws.connect",
  "detail-type": "Amazon Connect Contact Event",
  "detail": {
    "eventType": "DISCONNECTED",
    "contactId": "call_abc123",
    "channel": "VOICE",
    "instanceArn": "arn:aws:connect:us-east-1:123456789:instance/...",
    "queue": { "name": "SchedulingQueue", "arn": "..." },
    "agent": { "arn": "arn:aws:connect:...", "connectedToAgentTimestamp": "..." },
    "recording": { "location": "s3://mybcat-recordings/2026/04/21/call_abc123.wav" },
    "attributes": {
      "practiceId": "practice_p001",   #  set during contact flow
      "patientPhone": "+15551234567",
      "callOutcome": "appointment_booked"
    }
  }
}
HCL
# EventBridge rules  routing this event

# Rule 1: ALL voice calls  transcription
resource "aws_cloudwatch_event_rule" "call_transcription" {
  name           = "connect-call-transcription"
  event_bus_name = "default"

  event_pattern = jsonencode({
    source        = ["aws.connect"]
    "detail-type" = ["Amazon Connect Contact Event"]
    detail = {
      eventType = ["DISCONNECTED"]
      channel   = ["VOICE"]
    }
  })
}

resource "aws_cloudwatch_event_target" "transcription" {
  rule           = aws_cloudwatch_event_rule.call_transcription.name
  arn            = aws_sqs_queue.transcription.arn
}

# Rule 2: MISSED calls  missed call workflow
resource "aws_cloudwatch_event_rule" "missed_calls" {
  name           = "connect-missed-calls"
  event_bus_name = "default"

  event_pattern = jsonencode({
    source        = ["aws.connect"]
    "detail-type" = ["Amazon Connect Contact Event"]
    detail = {
      eventType = ["DISCONNECTED"]
      channel   = ["VOICE"]
      attributes = {
        callOutcome = ["missed", "abandoned", "voicemail"]
      }
    }
  })
}

resource "aws_cloudwatch_event_target" "missed_call_handler" {
  rule = aws_cloudwatch_event_rule.missed_calls.name
  arn  = aws_lambda_function.missed_call_workflow.arn
}

# Rule 3: ALL Connect events  audit log (never miss anything)
resource "aws_cloudwatch_event_rule" "connect_audit" {
  name           = "connect-full-audit"
  event_bus_name = "default"

  event_pattern = jsonencode({
    source = ["aws.connect"]
  })
}

resource "aws_cloudwatch_event_target" "connect_audit_log" {
  rule = aws_cloudwatch_event_rule.connect_audit.name
  arn  = aws_cloudwatch_log_group.connect_events.arn
}

Rule 3 catches everything to CloudWatch Logs with zero processing. HIPAA audit trail for every Connect interaction — automatic, comprehensive, costs pennies.


Real Example 7: EventBridge Scheduler — Replacing Cron Jobs

MyBCAT needs weekly practice reports every Monday at 7am ET, daily insurance verification retries at 6am ET, and hourly slot availability cache refresh.

Old approach: EC2 instance running cron — always-on, must be maintained, costs $30/month, fails silently if the instance is down.

New approach: EventBridge Scheduler → Lambda — serverless, zero maintenance, costs cents.

HCL
# Weekly reports  every Monday at 7am ET
resource "aws_scheduler_schedule" "weekly_reports" {
  name = "mybcat-weekly-practice-reports"

  flexible_time_window { mode = "OFF" }

  schedule_expression          = "cron(0 12 ? * MON *)"  # 12:00 UTC = 7:00 AM ET
  schedule_expression_timezone = "America/New_York"

  target {
    arn      = aws_lambda_function.generate_weekly_reports.arn
    role_arn = aws_iam_role.scheduler_invoke.arn

    input = jsonencode({
      reportType = "weekly_practice_summary"
      includeMetrics = ["callAnswerRate", "noShowRate", "bookingConversionRate"]
    })
  }
}

# Daily insurance retry  6am ET every day
resource "aws_scheduler_schedule" "insurance_retry" {
  name = "mybcat-daily-insurance-retry"

  flexible_time_window { mode = "FLEXIBLE", maximum_window_in_minutes = 15 }
  # FLEXIBLE: runs sometime between 6:00am and 6:15am
  # Prevents thousands of scheduled jobs from all firing at exactly the same second
  # across multiple customers/environments

  schedule_expression = "cron(0 11 * * ? *)"  # 11:00 UTC = 6:00 AM ET

  target {
    arn      = aws_lambda_function.retry_failed_verifications.arn
    role_arn = aws_iam_role.scheduler_invoke.arn
  }
}

FLEXIBLE time window is important. If 30 customers all have a 6am job, without flexibility all 30 Lambda invocations hit at exactly 6:00:00.000. With a 15-minute flexible window, they are distributed across 6:00–6:15. Cold starts are staggered. DynamoDB write spikes are smoothed. The system behaves more predictably.


The Outbox Pattern — Guaranteed Event Delivery

The most critical pattern in event-driven architecture. Without it, events are silently lost during failures.

The Problem

def book_appointment(patient_id, slot_id):
    # Step 1: Update database
    db.execute("UPDATE slots SET status='booked' WHERE id=?", slot_id)
    db.execute("INSERT INTO appointments ...", ...)
    db.commit()

    # Step 2: Publish event
    sqs.send_message(QueueUrl=..., MessageBody=...)

    # What if the process crashes between Step 1 and Step 2?
    # Database: slot is booked ✓
    # SQS: event never published ✗
    # → CRM never updated
    # → Patient never gets SMS confirmation
    # → Analytics never records the booking
    # → No error. No alarm. Silent failure.

The Solution

Python
def book_appointment(patient_id: str, slot_id: str):
    # Single atomic database transaction:
    # Write the booking AND the outbox event together
    with db.transaction():
        # Book the slot
        db.execute("""
            UPDATE slots SET status='booked', patient_id=?
            WHERE id=? AND status='available'
        """, (patient_id, slot_id))

        if db.rowcount == 0:
            raise SlotUnavailableError("Slot already booked")

        # Create appointment record
        appointment_id = str(uuid.uuid4())
        db.execute("""
            INSERT INTO appointments (id, patient_id, slot_id, status, created_at)
            VALUES (?, ?, ?, 'confirmed', NOW())
        """, (appointment_id, patient_id, slot_id))

        # Write outbox record IN THE SAME TRANSACTION
        # If the transaction rolls back, the outbox record is also gone  atomic
        db.execute("""
            INSERT INTO outbox (id, event_type, payload, status, created_at)
            VALUES (?, 'APPOINTMENT_BOOKED', ?, 'PENDING', NOW())
        """, (str(uuid.uuid4()), json.dumps({
            'appointmentId': appointment_id,
            'patientId': patient_id,
            'slotId': slot_id
        })))

# Separate outbox poller  runs every 5 seconds via EventBridge Scheduler
def process_outbox(event, context):
    # Find all unprocessed outbox rows
    pending = db.execute("""
        SELECT id, event_type, payload
        FROM outbox
        WHERE status = 'PENDING'
        AND created_at < NOW() - INTERVAL 1 SECOND
        ORDER BY created_at ASC
        LIMIT 100
    """).fetchall()

    for row in pending:
        try:
            # Publish to SQS/EventBridge
            events_client.put_events(Entries=[{
                'Source': 'com.mybcat.scheduling',
                'DetailType': row['event_type'],
                'Detail': row['payload']
            }])

            # Mark as processed
            db.execute(
                "UPDATE outbox SET status='PROCESSED', processed_at=NOW() WHERE id=?",
                (row['id'],)
            )
        except Exception as e:
            logger.error(f"Failed to publish outbox row {row['id']}: {e}")
            db.execute(
                "UPDATE outbox SET retry_count = retry_count + 1 WHERE id=?",
                (row['id'],)
            )

Why this works:

  1. Database transaction commits: slot booked + outbox row created atomically
  2. Process crashes: transaction rolled back → slot NOT booked, outbox row NOT created → nothing to retry
  3. Database commits but poller crashes before publishing: outbox row still PENDING → poller picks it up on next run
  4. Event published but poller crashes before marking PROCESSED: poller publishes it again on next run → idempotent consumers handle the duplicate

The only way to lose an event is if the database itself is lost — in which case you have bigger problems.


All Three Together — The Complete Pattern

Here is a production system that uses SQS, SNS, and EventBridge together correctly:

                    [APPOINTMENT BOOKED]
                   (booking Lambda writes
                    to outbox + database)
                           ↓
              [Outbox Poller Lambda publishes]
                           ↓
                  EventBridge Custom Bus
                           ↓
        ┌──────────────────┼──────────────────┐
        │                  │                  │
  Rule: all bookings  Rule: comprehensive  Rule: all events
        │              exam only            → CloudWatch Logs
        ▼                  ▼                  (audit trail)
    SNS Topic           Lambda
  appointment-events  (schedule
        │              ophthalmology
        │               follow-up)
   ┌────┼────────┐
   ▼    ▼        ▼
  SQS  SQS      SQS
  CRM  Patient  Analytics
  Q    SMS Q    Q
   │    │        │
   λ    λ        λ
 Update Send   Update
 HubSpot confirm daily
        ation  report

Each component's failure is contained.
Every event is guaranteed to be published (outbox).
Every consumer processes at its own pace (SQS).
Routing is content-driven (EventBridge rules).
Fan-out requires no changes to upstream services (SNS).

Azure Equivalents — Same Patterns, Different Names

If you know this on AWS, you know it on Azure:

AWS SQS Standard     = Azure Service Bus Queue (standard)
AWS SQS FIFO         = Azure Service Bus Queue (sessions enabled)
AWS SQS DLQ          = Azure Service Bus Dead-letter subqueue (built-in, no config needed)
AWS SNS              = Azure Service Bus Topics + Subscriptions
AWS SNS filter       = Azure Service Bus subscription filter rules (SQL expressions)
AWS EventBridge      = Azure Event Grid
AWS EventBridge rule = Azure Event Grid subscription with subject/type filter
AWS EventBridge Scheduler = Azure Logic Apps timer trigger or Azure Scheduler
AWS Kinesis          = Azure Event Hubs (high-throughput streaming, not queuing)

One important Azure advantage: Azure Service Bus has sessions — a feature that groups related messages and guarantees one consumer processes all messages in a session in order. Useful for: processing all events for one patient in order, processing all messages for one order in sequence. AWS SQS FIFO MessageGroupId achieves similar ordering but with lower throughput.


What Breaks in Production — Common Failures


Failure 1: Visibility Timeout Too Short

Symptom: Same message processed multiple times
         Duplicate emails sent, duplicate records created

Cause: Lambda takes 45 seconds to process
       Visibility timeout is 30 seconds
       At t=30: message reappears, second Lambda picks it up
       Both Lambdas complete processing

Fix:
  visibility_timeout = max(6 × lambda_timeout, lambda_timeout + 30)
  For Lambda timeout=60s → visibility_timeout=360s minimum
  
  AND make the consumer idempotent:
  Check if this messageId was already processed before doing work

Failure 2: No DLQ — Failed Messages Loop Forever

Symptom: Lambda errors repeatedly
         Queue size never decreases
         CloudWatch shows constant Lambda failures
         Your Lambda bill increases unexpectedly

Cause: Message has malformed data that always causes a parse error
       Without DLQ: SQS retries indefinitely
       Your Lambda is invoked thousands of times for an unprocessable message

Fix:
  Always configure a DLQ with maxReceiveCount=3
  Set a CloudWatch alarm on DLQ depth > 0
  Investigate DLQ messages within 24 hours — they represent real failures

Failure 3: SNS → Lambda Without SQS Buffer

Symptom: Events lost during Lambda throttling
         Intermittent missing notifications

Cause: SNS publishes 500 events/second
       Lambda concurrency limit: 100
       400 Lambda invocations fail with throttling error
       SNS retries a few times then drops the message

Fix:
  SNS → SQS Queue → Lambda (NOT SNS → Lambda directly)
  SQS buffers the 500 events
  Lambda drains at its own pace
  Nothing is lost

Failure 4: EventBridge Rule Too Broad

Symptom: Lambda runs on every AWS event
         Lambda bill is huge
         Logs are full of irrelevant events

Cause: Rule pattern: { "source": ["aws.connect"] }
       This matches ALL Connect events — agent login, agent logout,
       routing profile change, queue metrics updates, not just call ended

Fix: Be specific:
  {
    "source": ["aws.connect"],
    "detail-type": ["Amazon Connect Contact Event"],
    "detail": {
      "eventType": ["DISCONNECTED"],
      "channel": ["VOICE"]
    }
  }
  Only call disconnect events on the voice channel

Failure 5: Missing Idempotency on Consumers

Symptom: Duplicate emails, double charges, duplicate records

Cause: SQS at-least-once delivery — the same message can be delivered twice
       Your consumer processes it twice

Fix: Check before processing
  def process_message(message_id: str, payload: dict):
      # Check if already processed (using DynamoDB with TTL)
      if dynamodb.get_item(Key={'messageId': message_id}).get('Item'):
          logger.info(f"Duplicate message {message_id} — skipping")
          return

      # Process
      do_the_work(payload)

      # Mark processed with 24h TTL
      dynamodb.put_item(Item={
          'messageId': message_id,
          'processedAt': datetime.utcnow().isoformat(),
          'ttl': int(time.time()) + 86400
      })

The Decision Cheat Sheet

WHICH SERVICE?

One worker must do one job exactly once:
  → SQS Standard (idempotent consumer) or SQS FIFO (built-in deduplication)
  Examples: send one email, charge one card, create one record

Many services react to one event:
  → SNS → multiple SQS queues (fan-out pattern)
  Examples: order placed, patient admitted, call ended

Route events based on their content:
  → EventBridge with specific rules
  Examples: branch=main → deploy, outcome=missed → callback workflow

Connect AWS services to each other automatically:
  → EventBridge (it watches all AWS service events by default)
  Examples: S3 file uploaded → Lambda, RDS snapshot complete → notify

Run code on a schedule:
  → EventBridge Scheduler
  Examples: weekly reports, daily retries, hourly cache refresh

High-throughput ordered streaming (millions/second):
  → Kinesis (AWS) / Event Hubs (Azure)
  Examples: IoT telemetry, clickstream analytics, log aggregation

WHAT TO COMBINE?

Guaranteed delivery → Outbox Pattern (DB + SQS in one transaction)
Fan-out without loss → SNS → SQS (never SNS → Lambda directly)
All three → EventBridge routes → SNS fans out → SQS buffers → Lambda processes

Interview Answer Templates

"What is the difference between SQS, SNS, and EventBridge?"

"SQS is a job queue — one message, one consumer, guaranteed delivery with retry. SNS is a broadcast — one message, all subscribers receive it, for fan-out scenarios. EventBridge is a smart routing bus — events flow through rules that filter and route based on content, connecting AWS services to each other and to custom applications. In production I almost always use all three together: EventBridge routes events from AWS services, SNS fans them out to interested parties, and SQS queues buffer each consumer's workload independently."

"How do you guarantee an event is never lost?"

"The Outbox pattern. Instead of writing to the database and publishing to SQS in two separate operations, I write both the business record and an outbox event row in a single database transaction. A separate poller reads unprocessed outbox rows and publishes them to EventBridge or SQS. If the service crashes between the commit and the publish, the outbox row is still there — the poller picks it up on the next run. The only risk is duplicate delivery, which consumers handle with idempotency keys."

"What happens when your Lambda consumer fails?"

"SQS retries the message based on the visibility timeout — it reappears after the timeout expires and another Lambda invocation picks it up. After maxReceiveCount failures, the message moves to the Dead Letter Queue. I always configure a DLQ and a CloudWatch alarm on DLQ depth greater than zero. A message in the DLQ means there is a real, consistent failure I need to investigate — malformed data, downstream service down, or a bug in the Lambda. Without a DLQ, failed messages retry forever and your Lambda invocation count and cost grow unboundedly."

Event-Driven Architecture Knowledge Check

5 questions · Test what you just learned · Instant explanations

Enjoyed this article?

Explore the Backend Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.