Event-Driven Architecture: SQS, SNS, and EventBridge — Complete Guide with Real Examples
Everything about event-driven architecture on AWS and Azure. SQS queues, SNS fan-out, EventBridge routing — how they work, when to use each, how they combine, what breaks in production, and real flows from Uber, Amazon, Netflix, and healthcare platforms.
What Is Event-Driven Architecture?
In a traditional request-response system, Service A calls Service B directly:
Service A ──── HTTP POST ────► Service B
waits...
Service A ◄─── response ───── Service BService A blocks until Service B responds. If Service B is slow, Service A is slow. If Service B is down, Service A fails. They are tightly coupled.
In an event-driven system, Service A publishes an event and moves on:
Service A ──── publishes event ────► Message Bus
│
┌─────┴─────┐
▼ ▼
Service B Service C
(processes (processes
when ready) when ready)Service A does not know who consumes the event. It does not wait. If Service B is down, the event waits in the queue — nothing is lost. Service C is completely unaware of Service B. Adding a new Service D requires zero changes to A, B, or C.
Real-world analogy: A newspaper publisher. The publisher prints the news (publishes an event) and distributes it. Readers (consumers) read it at their own pace. Adding a new subscriber does not require the publisher to change anything. If one reader is on holiday (service down), their paper waits at the door.
The Three Services — What Each One Is
Before going deep, understand the fundamental role of each:
SQS (Simple Queue Service)
→ Point-to-point: one message, one consumer, processed once
→ "Do this specific task" — a job queue
→ The message is deleted after successful processing
→ Azure equivalent: Service Bus Queues
SNS (Simple Notification Service)
→ Fan-out: one message, many consumers, all receive it simultaneously
→ "Something happened, everyone who cares should know"
→ Message is not stored — subscribers either receive it or miss it
→ Azure equivalent: Service Bus Topics + Subscriptions
EventBridge
→ Smart routing: events flow through rules that filter and route by content
→ "Route this event to different places based on what it contains"
→ Connects AWS services to each other and to custom applications
→ Azure equivalent: Event GridThey are not alternatives. They solve different problems. In production systems you almost always use all three together.
SQS — Deep Dive
How SQS Works
Producer Queue Consumer
│ │ │
├── send message ──────►│ │
│ │◄── receive messages ────┤
│ │ │ (message invisible
│ │ │ to others for
│ │ │ visibility timeout)
│ │◄── delete message ──────┤ (after processing)
│ │ │
│ ├── (N retries fail) ─────►DLQKey concepts:
Visibility Timeout: When a consumer reads a message, it becomes invisible to all other consumers for X seconds. If the consumer processes it and deletes it → done. If the consumer crashes without deleting it → message reappears after the timeout and another consumer picks it up. This is how SQS guarantees at-least-once delivery.
Timeline:
t=0: Consumer A reads message → message invisible for 30s
t=0-30: Consumer A processes the message
t=28: Consumer A deletes the message → done
OR:
t=15: Consumer A crashes
t=30: Visibility timeout expires → message reappears
t=31: Consumer B picks it up and processes itDead Letter Queue (DLQ): After N failed processing attempts (maxReceiveCount), the message moves to a separate DLQ. It sits there until an engineer investigates. Without a DLQ, failed messages cycle forever — blocking the queue and retrying endlessly.
SQS Standard vs FIFO
| | Standard Queue | FIFO Queue | |---|---|---| | Ordering | Best-effort (may be out of order) | Strict first-in-first-out | | Delivery | At-least-once (may deliver twice) | Exactly-once | | Throughput | Unlimited | 300 msg/s (3,000 with batching) | | Cost | Cheaper | 10× more expensive | | Use when | Order does not matter, idempotency handled in code | Order matters, or exactly-once required |
Use Standard for: sending emails, processing call recordings, generating reports — if the same email is sent twice it is annoying but not catastrophic, handled with idempotency checks.
Use FIFO for: financial transactions, state machine transitions, order processing — if the same charge executes twice you have a problem.
Real Example 1: Amazon Order Processing
When you place an order on Amazon, 12 things need to happen. Some can happen in parallel. Some must happen in a specific order. None of them should block your checkout experience.
You click "Place Order"
↓
Order Service:
1. Saves order to database (synchronous — must complete before "Order confirmed" page)
2. Returns "Order #112-3456789 confirmed!" to your browser
3. Sends 12 messages to SQS queues (async — happens after response)
Queue assignments:
InventoryQueue:
Message: { orderId: "112-3456789", items: [...] }
Consumer Lambda: reserve stock for each item
If stock unavailable → email customer, cancel order item
PaymentQueue:
Message: { orderId: "112-3456789", amount: 89.99, paymentMethod: "card_visa" }
Consumer Lambda: charge the card
If charge fails → inventory release + customer notification (via another queue)
FulfillmentQueue:
Message: { orderId: "112-3456789", warehouseId: "LHR7" }
Consumer Lambda: create pick list for warehouse worker
NotificationQueue:
Message: { orderId: "112-3456789", email: "customer@example.com" }
Consumer Lambda: send confirmation email
AnalyticsQueue:
Message: { orderId: "112-3456789", value: 89.99, category: "electronics" }
Consumer Lambda: update real-time sales dashboard
All 5 queues process in parallel.
If the analytics Lambda is slow: only the analytics queue backs up.
The payment and fulfilment queues are unaffected.
Customer gets their order on time.What breaks without SQS (tight coupling):
You click "Place Order"
↓
Order Service calls:
→ InventoryService.Reserve() — 50ms
→ PaymentService.Charge() — 800ms (credit card network is slow)
→ FulfillmentService.CreatePickList() — 200ms
→ EmailService.SendConfirmation() — 300ms (email provider is slow)
→ AnalyticsService.Record() — 100ms
Total: 1,450ms before "Order confirmed" page appears
If EmailService is down:
Your order fails with a 500 error
Payment was NOT taken (rolled back)
You have to re-order
Amazon lost the sale because the email service was downReal Example 2: MyBCAT Post-Call Workflow
A call ends. Six things must happen. None should block the others.
# call_ended Lambda — triggered by Amazon Connect EventBridge event
def handler(event, context):
call_data = event['detail']
# Determine which queues get which payloads
messages = [
{
'QueueUrl': TRANSCRIPTION_QUEUE,
'MessageBody': json.dumps({
'callId': call_data['contactId'],
'recordingKey': call_data['recordingS3Key'],
'practiceId': call_data['practiceId'],
'agentId': call_data['agentId']
}),
'MessageGroupId': call_data['practiceId'], # FIFO — process per practice in order
'MessageDeduplicationId': call_data['contactId'] # exactly once
},
{
'QueueUrl': CRM_UPDATE_QUEUE,
'MessageBody': json.dumps({
'callId': call_data['contactId'],
'patientPhone': call_data['customerEndpoint'],
'duration': call_data['disconnectTimestamp'] - call_data['initiationTimestamp'],
'outcome': call_data['disconnectReason']
})
},
{
'QueueUrl': BILLING_QUEUE,
'MessageBody': json.dumps({
'practiceId': call_data['practiceId'],
'callId': call_data['contactId'],
'duration': call_data['duration'],
'callType': call_data['queue']
})
}
]
# Send all messages simultaneously
for msg in messages:
sqs.send_message(**msg)
return {'statusCode': 200}# transcription Lambda — processes TranscriptionQueue
def handler(event, context):
for record in event['Records']: # SQS can batch deliver multiple messages
body = json.loads(record['body'])
try:
# Download recording from S3
audio = s3.get_object(
Bucket='mybcat-recordings-prod',
Key=body['recordingKey']
)['Body'].read()
# Transcribe with DeepGram
transcript = deepgram_client.transcribe(audio, language='en-US')
# Save transcript to S3 and update call record
s3.put_object(
Bucket='mybcat-transcripts-prod',
Key=f"transcripts/{body['callId']}.json",
Body=json.dumps(transcript)
)
dynamodb.update_item(
Key={'PK': f"CALL#{body['callId']}"},
UpdateExpression='SET transcriptKey = :key, transcriptStatus = :status',
ExpressionAttributeValues={
':key': f"transcripts/{body['callId']}.json",
':status': 'COMPLETED'
}
)
except Exception as e:
logger.exception(f"Transcription failed for call {body['callId']}")
raise # Re-raise so SQS knows this message failed and should retry
# After maxReceiveCount failures → moves to TranscriptionDLQWhy raising the exception matters: If your Lambda catches an exception and returns normally, SQS thinks the message was processed successfully and deletes it. The failed transcription is silently lost. Raising the exception (or not catching it) signals failure — SQS keeps the message and retries.
SQS Configuration That Matters in Production
resource "aws_sqs_queue" "transcription" {
name = "mybcat-transcription-${var.env}"
visibility_timeout_seconds = 300 # How long consumer has to process
# Must be > your Lambda timeout
# If Lambda timeout = 60s, set this to 120s minimum
message_retention_seconds = 86400 # Keep messages for 1 day if not processed
receive_wait_time_seconds = 20 # Long polling — waits up to 20s for messages
# Reduces empty receives, saves cost
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.transcription_dlq.arn
maxReceiveCount = 3 # After 3 failures → DLQ
})
}
resource "aws_sqs_queue" "transcription_dlq" {
name = "mybcat-transcription-dlq-${var.env}"
message_retention_seconds = 1209600 # Keep failed messages for 14 days
}
# Alarm: DLQ has messages — something is broken
resource "aws_cloudwatch_metric_alarm" "transcription_dlq" {
alarm_name = "transcription-dlq-not-empty"
comparison_operator = "GreaterThanThreshold"
threshold = 0
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Sum"
dimensions = { QueueName = aws_sqs_queue.transcription_dlq.name }
alarm_actions = [aws_sns_topic.engineering_alerts.arn]
alarm_description = "Transcription DLQ has failed messages — investigate immediately"
}
# Lambda event source mapping
resource "aws_lambda_event_source_mapping" "transcription" {
event_source_arn = aws_sqs_queue.transcription.arn
function_name = aws_lambda_function.transcribe_call.arn
batch_size = 5 # Process 5 messages per Lambda invocation
maximum_batching_window_in_seconds = 30 # Wait up to 30s to batch 5 messages
}Visibility timeout trap: If your Lambda timeout is 60 seconds and your visibility timeout is 30 seconds, the following happens:
- Lambda picks up the message (t=0)
- Lambda runs for 35 seconds processing
- Visibility timeout expires at t=30 — message becomes visible again
- A second Lambda picks up the same message (t=31)
- Both Lambdas are now processing the same message simultaneously
Always set visibility_timeout_seconds to at least 6× your Lambda timeout. This leaves room for retries within the timeout window.
SNS — Deep Dive
How SNS Works
SNS is a pub/sub system. Publishers send to a topic. Subscribers receive from a topic. Every subscriber gets every message.
Publisher
│
└──► SNS Topic
│
┌────┴────┬──────┬──────┐
▼ ▼ ▼ ▼
SQS Q SQS Q Lambda Email
(Consumer (Consumer (direct (human
A) B) invoke) notification)
Every subscriber receives every message independently.
Consumer A being slow does not affect Consumer B.Subscription types: SQS (most common), Lambda, HTTP endpoint, email, SMS, mobile push.
The golden pattern: SNS → SQS (fan-out)
Never subscribe a Lambda directly to SNS in production for high-volume events. If Lambda is throttled (concurrency limit hit), messages are dropped. Subscribe an SQS queue to SNS — the queue buffers messages. Lambda drains the queue at a controlled pace. Nothing is lost.
SNS Topic
│
┌─────────────┼─────────────┐
▼ ▼ ▼
SQS Q1 SQS Q2 SQS Q3
│ │ │
Lambda A Lambda B Lambda C
(transcription) (CRM update) (billing)Real Example 3: Uber's Trip Events
When a trip status changes, every Uber system needs to know. The trip lifecycle generates dozens of events: requested, driver accepted, driver arrived, trip started, trip ended, payment processed, receipt sent.
Trip event: TRIP_COMPLETED
{
"tripId": "trip_789abc",
"riderId": "rider_123",
"driverId": "driver_456",
"distance": 8.3,
"duration": 1420,
"fare": 22.50,
"currency": "GBP",
"pickupLocation": { "lat": 51.507, "lng": -0.127 },
"dropoffLocation": { "lat": 51.524, "lng": -0.099 }
}
↓
SNS Topic: trip-events
↓
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
SQS SQS SQS SQS SQS SQS
Payment Receipts Analytics RiderRating DriverPay Safety
Queue Queue Queue Queue Queue Queue
Each queue has its own Lambda consumer:
Payment Consumer: → Charge rider's card £22.50
Receipts Consumer: → Generate PDF receipt, email to rider
Analytics Consumer: → Update driver earnings dashboard, city heatmap
RiderRating Consumer: → Send "Rate your trip" push notification (after 5 minutes)
DriverPay Consumer: → Add £22.50 × 0.75 (after commission) to driver's weekly earnings
Safety Consumer: → Log trip route, flag if unusual pattern detectedWhat makes this powerful:
- Uber's safety team added a new "log trip route for safety analysis" feature. They added one SQS subscription to the SNS topic. Zero changes to the Trip Service, Payment Service, or any other service.
- The Driver Pay calculation has a temporary bug that's crashing its Lambda. The payment consumer's queue backs up. Safety, receipts, analytics, and ratings all process normally.
- Six months later, Uber adds a carbon footprint calculation feature. One new SQS subscription. No deployments to existing services.
Real Example 4: NHS — Patient Admission Events
A patient is admitted to hospital. Every department system needs to update simultaneously:
# SNS publish — admission service
def admit_patient(patient_id: str, ward: str, doctor: str):
# 1. Save to database (synchronous)
db.execute(
"INSERT INTO admissions (patient_id, ward, doctor, admitted_at) VALUES (%s, %s, %s, NOW())",
(patient_id, ward, doctor)
)
# 2. Publish event (asynchronous — does not block admission)
sns.publish(
TopicArn=PATIENT_EVENTS_TOPIC,
Message=json.dumps({
'eventType': 'PATIENT_ADMITTED',
'patientId': patient_id,
'ward': ward,
'attendingDoctor': doctor,
'timestamp': datetime.utcnow().isoformat()
}),
MessageAttributes={
'eventType': {
'DataType': 'String',
'StringValue': 'PATIENT_ADMITTED'
}
}
)
return {'status': 'admitted', 'patientId': patient_id}SNS Topic: nhs-patient-events
↓
SQS subscription filter: eventType = "PATIENT_ADMITTED"
↓
┌─────────────┬──────────────┬────────────┬───────────┬──────────────┐
Ward Mgmt Pharmacy Catering Billing Lab Systems Audit
(allocate (prepare (add to (open (create (GDPR
bed) medications) meal plan) account) test order compliance
profile) log)SNS message filtering: Each subscription can filter messages by attributes. The billing queue only receives PATIENT_ADMITTED and PATIENT_DISCHARGED events — not MEDICATION_ADMINISTERED events. This reduces queue volume and irrelevant processing.
resource "aws_sns_topic_subscription" "billing" {
topic_arn = aws_sns_topic.patient_events.arn
protocol = "sqs"
endpoint = aws_sqs_queue.billing.arn
filter_policy = jsonencode({
eventType = ["PATIENT_ADMITTED", "PATIENT_DISCHARGED", "PROCEDURE_COMPLETED"]
# billing only sees events that affect the invoice
})
}
resource "aws_sns_topic_subscription" "lab_systems" {
topic_arn = aws_sns_topic.patient_events.arn
protocol = "sqs"
endpoint = aws_sqs_queue.lab.arn
filter_policy = jsonencode({
eventType = ["PATIENT_ADMITTED", "LAB_TEST_ORDERED", "SPECIMEN_COLLECTED"]
})
}Without filter policies, every system receives every event and wastes compute discarding irrelevant ones.
EventBridge — Deep Dive
How EventBridge Works
EventBridge is an event bus with routing rules. Events have a structure; rules match on that structure; matched events are routed to targets.
Event Source Event Bus Rule + Target
(what happened) (where events flow) (who gets what)
AWS services ──► default event bus ──► Rule: source=aws.s3
Your application ──► custom event bus ──► Rule: detail-type=OrderPlaced
Partner services ──► partner event bus ──► Rule: detail.status=FAILED
│
Rules are AND/OR filters
on the event's JSON structure
│
Targets:
Lambda, SQS, SNS, Step Functions,
API Gateway, another EventBridge bus,
CloudWatch Logs, Kinesis, and 20+ moreThe Event Structure
Every EventBridge event has the same envelope:
{
"version": "0",
"id": "uuid-for-this-event",
"source": "com.mybcat.scheduling",
"account": "123456789",
"time": "2026-04-21T14:23:01Z",
"region": "us-east-1",
"detail-type": "AppointmentBooked",
"detail": {
"appointmentId": "appt_abc123",
"practiceId": "practice_p001",
"patientId": "pat_789",
"slotDate": "2026-04-24",
"slotTime": "10:00",
"type": "comprehensive_exam",
"agentId": "user_sarah"
}
}Rules match on any combination of fields:
{
"source": ["com.mybcat.scheduling"],
"detail-type": ["AppointmentBooked"],
"detail": {
"type": ["comprehensive_exam", "contact_lens_fitting"],
"practiceId": [{ "prefix": "practice_p" }]
}
}This rule matches only AppointmentBooked events for comprehensive exams or contact lens fittings at any practice starting with "practice_p". High specificity means each rule triggers only relevant Lambdas.
Real Example 5: GitHub CI/CD Pipeline Routing
When code is pushed to GitHub, different actions should happen based on branch name, changed files, and event type. Without EventBridge, you would write a monolithic Lambda with if/elif/else chains for every scenario. With EventBridge, each rule is independent.
GitHub webhook → API Gateway → normalize-event Lambda
↓
Custom EventBridge bus
↓
┌───────────────────────────────────────────────────────────────────┐
│ │
│ Rule 1: branch = "main" │
│ Target: SQS → full-test-suite Lambda │
│ │
│ Rule 2: branch = "main" AND files contain "infrastructure/" │
│ Target: SQS → terraform-plan Lambda │
│ │
│ Rule 3: branch matches "feature/*" │
│ Target: SQS → unit-tests-only Lambda │
│ │
│ Rule 4: event-type = "pull_request" AND action = "opened" │
│ Target: Lambda → post-reviewer-assignment Lambda │
│ │
│ Rule 5: tag matches "v*" (release tag) │
│ Target: Step Functions → production-deployment workflow │
│ │
│ Rule 6: ALL events │
│ Target: CloudWatch Logs → full audit trail │
│ │
└───────────────────────────────────────────────────────────────────┘# normalize-event Lambda — converts GitHub webhook to EventBridge event
def handler(event, context):
github_event = json.loads(event['body'])
event_type = event['headers'].get('X-GitHub-Event', 'unknown')
# Build standardised EventBridge event
eb_event = {
'Source': 'com.github.webhook',
'DetailType': f'github.{event_type}',
'Detail': json.dumps({
'repository': github_event['repository']['full_name'],
'branch': github_event.get('ref', '').replace('refs/heads/', ''),
'pusher': github_event.get('pusher', {}).get('name'),
'commits': github_event.get('commits', []),
'changedFiles': extract_changed_files(github_event),
'tag': extract_tag(github_event.get('ref', ''))
}),
'EventBusName': 'mybcat-cicd-events'
}
events_client.put_events(Entries=[eb_event])
return {'statusCode': 200}Each rule is independently testable, deployable, and maintainable. Adding a new CI step (e.g., security scanning on every PR) adds one new rule — no changes to existing Lambdas.
Real Example 6: Amazon Connect → MyBCAT Post-Call Orchestration
Amazon Connect natively publishes events to EventBridge. When a call ends, the event lands in the default event bus and flows through your routing rules.
# The event Connect publishes (you do not control this — it is automatic)
{
"version": "0",
"source": "aws.connect",
"detail-type": "Amazon Connect Contact Event",
"detail": {
"eventType": "DISCONNECTED",
"contactId": "call_abc123",
"channel": "VOICE",
"instanceArn": "arn:aws:connect:us-east-1:123456789:instance/...",
"queue": { "name": "SchedulingQueue", "arn": "..." },
"agent": { "arn": "arn:aws:connect:...", "connectedToAgentTimestamp": "..." },
"recording": { "location": "s3://mybcat-recordings/2026/04/21/call_abc123.wav" },
"attributes": {
"practiceId": "practice_p001", # ← set during contact flow
"patientPhone": "+15551234567",
"callOutcome": "appointment_booked"
}
}
}# EventBridge rules — routing this event
# Rule 1: ALL voice calls → transcription
resource "aws_cloudwatch_event_rule" "call_transcription" {
name = "connect-call-transcription"
event_bus_name = "default"
event_pattern = jsonencode({
source = ["aws.connect"]
"detail-type" = ["Amazon Connect Contact Event"]
detail = {
eventType = ["DISCONNECTED"]
channel = ["VOICE"]
}
})
}
resource "aws_cloudwatch_event_target" "transcription" {
rule = aws_cloudwatch_event_rule.call_transcription.name
arn = aws_sqs_queue.transcription.arn
}
# Rule 2: MISSED calls → missed call workflow
resource "aws_cloudwatch_event_rule" "missed_calls" {
name = "connect-missed-calls"
event_bus_name = "default"
event_pattern = jsonencode({
source = ["aws.connect"]
"detail-type" = ["Amazon Connect Contact Event"]
detail = {
eventType = ["DISCONNECTED"]
channel = ["VOICE"]
attributes = {
callOutcome = ["missed", "abandoned", "voicemail"]
}
}
})
}
resource "aws_cloudwatch_event_target" "missed_call_handler" {
rule = aws_cloudwatch_event_rule.missed_calls.name
arn = aws_lambda_function.missed_call_workflow.arn
}
# Rule 3: ALL Connect events → audit log (never miss anything)
resource "aws_cloudwatch_event_rule" "connect_audit" {
name = "connect-full-audit"
event_bus_name = "default"
event_pattern = jsonencode({
source = ["aws.connect"]
})
}
resource "aws_cloudwatch_event_target" "connect_audit_log" {
rule = aws_cloudwatch_event_rule.connect_audit.name
arn = aws_cloudwatch_log_group.connect_events.arn
}Rule 3 catches everything to CloudWatch Logs with zero processing. HIPAA audit trail for every Connect interaction — automatic, comprehensive, costs pennies.
Real Example 7: EventBridge Scheduler — Replacing Cron Jobs
MyBCAT needs weekly practice reports every Monday at 7am ET, daily insurance verification retries at 6am ET, and hourly slot availability cache refresh.
Old approach: EC2 instance running cron — always-on, must be maintained, costs $30/month, fails silently if the instance is down.
New approach: EventBridge Scheduler → Lambda — serverless, zero maintenance, costs cents.
# Weekly reports — every Monday at 7am ET
resource "aws_scheduler_schedule" "weekly_reports" {
name = "mybcat-weekly-practice-reports"
flexible_time_window { mode = "OFF" }
schedule_expression = "cron(0 12 ? * MON *)" # 12:00 UTC = 7:00 AM ET
schedule_expression_timezone = "America/New_York"
target {
arn = aws_lambda_function.generate_weekly_reports.arn
role_arn = aws_iam_role.scheduler_invoke.arn
input = jsonencode({
reportType = "weekly_practice_summary"
includeMetrics = ["callAnswerRate", "noShowRate", "bookingConversionRate"]
})
}
}
# Daily insurance retry — 6am ET every day
resource "aws_scheduler_schedule" "insurance_retry" {
name = "mybcat-daily-insurance-retry"
flexible_time_window { mode = "FLEXIBLE", maximum_window_in_minutes = 15 }
# FLEXIBLE: runs sometime between 6:00am and 6:15am
# Prevents thousands of scheduled jobs from all firing at exactly the same second
# across multiple customers/environments
schedule_expression = "cron(0 11 * * ? *)" # 11:00 UTC = 6:00 AM ET
target {
arn = aws_lambda_function.retry_failed_verifications.arn
role_arn = aws_iam_role.scheduler_invoke.arn
}
}FLEXIBLE time window is important. If 30 customers all have a 6am job, without flexibility all 30 Lambda invocations hit at exactly 6:00:00.000. With a 15-minute flexible window, they are distributed across 6:00–6:15. Cold starts are staggered. DynamoDB write spikes are smoothed. The system behaves more predictably.
The Outbox Pattern — Guaranteed Event Delivery
The most critical pattern in event-driven architecture. Without it, events are silently lost during failures.
The Problem
def book_appointment(patient_id, slot_id):
# Step 1: Update database
db.execute("UPDATE slots SET status='booked' WHERE id=?", slot_id)
db.execute("INSERT INTO appointments ...", ...)
db.commit()
# Step 2: Publish event
sqs.send_message(QueueUrl=..., MessageBody=...)
# What if the process crashes between Step 1 and Step 2?
# Database: slot is booked ✓
# SQS: event never published ✗
# → CRM never updated
# → Patient never gets SMS confirmation
# → Analytics never records the booking
# → No error. No alarm. Silent failure.The Solution
def book_appointment(patient_id: str, slot_id: str):
# Single atomic database transaction:
# Write the booking AND the outbox event together
with db.transaction():
# Book the slot
db.execute("""
UPDATE slots SET status='booked', patient_id=?
WHERE id=? AND status='available'
""", (patient_id, slot_id))
if db.rowcount == 0:
raise SlotUnavailableError("Slot already booked")
# Create appointment record
appointment_id = str(uuid.uuid4())
db.execute("""
INSERT INTO appointments (id, patient_id, slot_id, status, created_at)
VALUES (?, ?, ?, 'confirmed', NOW())
""", (appointment_id, patient_id, slot_id))
# Write outbox record IN THE SAME TRANSACTION
# If the transaction rolls back, the outbox record is also gone — atomic
db.execute("""
INSERT INTO outbox (id, event_type, payload, status, created_at)
VALUES (?, 'APPOINTMENT_BOOKED', ?, 'PENDING', NOW())
""", (str(uuid.uuid4()), json.dumps({
'appointmentId': appointment_id,
'patientId': patient_id,
'slotId': slot_id
})))
# Separate outbox poller — runs every 5 seconds via EventBridge Scheduler
def process_outbox(event, context):
# Find all unprocessed outbox rows
pending = db.execute("""
SELECT id, event_type, payload
FROM outbox
WHERE status = 'PENDING'
AND created_at < NOW() - INTERVAL 1 SECOND
ORDER BY created_at ASC
LIMIT 100
""").fetchall()
for row in pending:
try:
# Publish to SQS/EventBridge
events_client.put_events(Entries=[{
'Source': 'com.mybcat.scheduling',
'DetailType': row['event_type'],
'Detail': row['payload']
}])
# Mark as processed
db.execute(
"UPDATE outbox SET status='PROCESSED', processed_at=NOW() WHERE id=?",
(row['id'],)
)
except Exception as e:
logger.error(f"Failed to publish outbox row {row['id']}: {e}")
db.execute(
"UPDATE outbox SET retry_count = retry_count + 1 WHERE id=?",
(row['id'],)
)Why this works:
- Database transaction commits: slot booked + outbox row created atomically
- Process crashes: transaction rolled back → slot NOT booked, outbox row NOT created → nothing to retry
- Database commits but poller crashes before publishing: outbox row still PENDING → poller picks it up on next run
- Event published but poller crashes before marking PROCESSED: poller publishes it again on next run → idempotent consumers handle the duplicate
The only way to lose an event is if the database itself is lost — in which case you have bigger problems.
All Three Together — The Complete Pattern
Here is a production system that uses SQS, SNS, and EventBridge together correctly:
[APPOINTMENT BOOKED]
(booking Lambda writes
to outbox + database)
↓
[Outbox Poller Lambda publishes]
↓
EventBridge Custom Bus
↓
┌──────────────────┼──────────────────┐
│ │ │
Rule: all bookings Rule: comprehensive Rule: all events
│ exam only → CloudWatch Logs
▼ ▼ (audit trail)
SNS Topic Lambda
appointment-events (schedule
│ ophthalmology
│ follow-up)
┌────┼────────┐
▼ ▼ ▼
SQS SQS SQS
CRM Patient Analytics
Q SMS Q Q
│ │ │
λ λ λ
Update Send Update
HubSpot confirm daily
ation report
Each component's failure is contained.
Every event is guaranteed to be published (outbox).
Every consumer processes at its own pace (SQS).
Routing is content-driven (EventBridge rules).
Fan-out requires no changes to upstream services (SNS).Azure Equivalents — Same Patterns, Different Names
If you know this on AWS, you know it on Azure:
AWS SQS Standard = Azure Service Bus Queue (standard)
AWS SQS FIFO = Azure Service Bus Queue (sessions enabled)
AWS SQS DLQ = Azure Service Bus Dead-letter subqueue (built-in, no config needed)
AWS SNS = Azure Service Bus Topics + Subscriptions
AWS SNS filter = Azure Service Bus subscription filter rules (SQL expressions)
AWS EventBridge = Azure Event Grid
AWS EventBridge rule = Azure Event Grid subscription with subject/type filter
AWS EventBridge Scheduler = Azure Logic Apps timer trigger or Azure Scheduler
AWS Kinesis = Azure Event Hubs (high-throughput streaming, not queuing)One important Azure advantage: Azure Service Bus has sessions — a feature that groups related messages and guarantees one consumer processes all messages in a session in order. Useful for: processing all events for one patient in order, processing all messages for one order in sequence. AWS SQS FIFO MessageGroupId achieves similar ordering but with lower throughput.
What Breaks in Production — Common Failures
Failure 1: Visibility Timeout Too Short
Symptom: Same message processed multiple times
Duplicate emails sent, duplicate records created
Cause: Lambda takes 45 seconds to process
Visibility timeout is 30 seconds
At t=30: message reappears, second Lambda picks it up
Both Lambdas complete processing
Fix:
visibility_timeout = max(6 × lambda_timeout, lambda_timeout + 30)
For Lambda timeout=60s → visibility_timeout=360s minimum
AND make the consumer idempotent:
Check if this messageId was already processed before doing workFailure 2: No DLQ — Failed Messages Loop Forever
Symptom: Lambda errors repeatedly
Queue size never decreases
CloudWatch shows constant Lambda failures
Your Lambda bill increases unexpectedly
Cause: Message has malformed data that always causes a parse error
Without DLQ: SQS retries indefinitely
Your Lambda is invoked thousands of times for an unprocessable message
Fix:
Always configure a DLQ with maxReceiveCount=3
Set a CloudWatch alarm on DLQ depth > 0
Investigate DLQ messages within 24 hours — they represent real failuresFailure 3: SNS → Lambda Without SQS Buffer
Symptom: Events lost during Lambda throttling
Intermittent missing notifications
Cause: SNS publishes 500 events/second
Lambda concurrency limit: 100
400 Lambda invocations fail with throttling error
SNS retries a few times then drops the message
Fix:
SNS → SQS Queue → Lambda (NOT SNS → Lambda directly)
SQS buffers the 500 events
Lambda drains at its own pace
Nothing is lostFailure 4: EventBridge Rule Too Broad
Symptom: Lambda runs on every AWS event
Lambda bill is huge
Logs are full of irrelevant events
Cause: Rule pattern: { "source": ["aws.connect"] }
This matches ALL Connect events — agent login, agent logout,
routing profile change, queue metrics updates, not just call ended
Fix: Be specific:
{
"source": ["aws.connect"],
"detail-type": ["Amazon Connect Contact Event"],
"detail": {
"eventType": ["DISCONNECTED"],
"channel": ["VOICE"]
}
}
Only call disconnect events on the voice channelFailure 5: Missing Idempotency on Consumers
Symptom: Duplicate emails, double charges, duplicate records
Cause: SQS at-least-once delivery — the same message can be delivered twice
Your consumer processes it twice
Fix: Check before processing
def process_message(message_id: str, payload: dict):
# Check if already processed (using DynamoDB with TTL)
if dynamodb.get_item(Key={'messageId': message_id}).get('Item'):
logger.info(f"Duplicate message {message_id} — skipping")
return
# Process
do_the_work(payload)
# Mark processed with 24h TTL
dynamodb.put_item(Item={
'messageId': message_id,
'processedAt': datetime.utcnow().isoformat(),
'ttl': int(time.time()) + 86400
})The Decision Cheat Sheet
WHICH SERVICE?
One worker must do one job exactly once:
→ SQS Standard (idempotent consumer) or SQS FIFO (built-in deduplication)
Examples: send one email, charge one card, create one record
Many services react to one event:
→ SNS → multiple SQS queues (fan-out pattern)
Examples: order placed, patient admitted, call ended
Route events based on their content:
→ EventBridge with specific rules
Examples: branch=main → deploy, outcome=missed → callback workflow
Connect AWS services to each other automatically:
→ EventBridge (it watches all AWS service events by default)
Examples: S3 file uploaded → Lambda, RDS snapshot complete → notify
Run code on a schedule:
→ EventBridge Scheduler
Examples: weekly reports, daily retries, hourly cache refresh
High-throughput ordered streaming (millions/second):
→ Kinesis (AWS) / Event Hubs (Azure)
Examples: IoT telemetry, clickstream analytics, log aggregation
WHAT TO COMBINE?
Guaranteed delivery → Outbox Pattern (DB + SQS in one transaction)
Fan-out without loss → SNS → SQS (never SNS → Lambda directly)
All three → EventBridge routes → SNS fans out → SQS buffers → Lambda processesInterview Answer Templates
"What is the difference between SQS, SNS, and EventBridge?"
"SQS is a job queue — one message, one consumer, guaranteed delivery with retry. SNS is a broadcast — one message, all subscribers receive it, for fan-out scenarios. EventBridge is a smart routing bus — events flow through rules that filter and route based on content, connecting AWS services to each other and to custom applications. In production I almost always use all three together: EventBridge routes events from AWS services, SNS fans them out to interested parties, and SQS queues buffer each consumer's workload independently."
"How do you guarantee an event is never lost?"
"The Outbox pattern. Instead of writing to the database and publishing to SQS in two separate operations, I write both the business record and an outbox event row in a single database transaction. A separate poller reads unprocessed outbox rows and publishes them to EventBridge or SQS. If the service crashes between the commit and the publish, the outbox row is still there — the poller picks it up on the next run. The only risk is duplicate delivery, which consumers handle with idempotency keys."
"What happens when your Lambda consumer fails?"
"SQS retries the message based on the visibility timeout — it reappears after the timeout expires and another Lambda invocation picks it up. After maxReceiveCount failures, the message moves to the Dead Letter Queue. I always configure a DLQ and a CloudWatch alarm on DLQ depth greater than zero. A message in the DLQ means there is a real, consistent failure I need to investigate — malformed data, downstream service down, or a bug in the Lambda. Without a DLQ, failed messages retry forever and your Lambda invocation count and cost grow unboundedly."
Event-Driven Architecture Knowledge Check
5 questions · Test what you just learned · Instant explanations
Enjoyed this article?
Explore the Backend Systems learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.