Event-Driven Architecture

Event-Driven Architecture (EDA) is a paradigm shift: instead of systems calling each other, systems emit facts about what happened and other systems react. This inversion of control is what makes EDA so powerful for building loosely coupled, independently scalable distributed systems. It is also what makes it harder to reason about, trace, and debug. This lesson covers both sides — the power and the pitfalls.

What Is an Event?

An event is an immutable record of something that happened in the past. Three characteristics define it:

Fact, not instruction — an event announces what occurred, not what should happen next. OrderPlaced vs ProcessOrder.
Immutable — once emitted, an event cannot be changed or retracted. The record is permanent.
Past tense — events describe completed actions: OrderShipped, PaymentFailed, UserRegistered.

Event vs. Command vs. Message

These three terms are often confused. They are distinct:

| Concept | Meaning | Direction | Example | |---------|---------|-----------|---------| | Command | An instruction to do something | Directed at a specific receiver | PlaceOrder, SendEmail | | Event | A fact about something that happened | Broadcast to all interested parties | OrderPlaced, EmailSent | | Message | The envelope that carries a command or event | Varies | Any payload sent over a channel |

Commands have a single intended recipient and can be rejected. Events have no intended recipient and cannot be rejected — the producer does not care who listens.

How EDA Decouples Systems

In a synchronous request/response world, the Order Service must know about and call the Inventory Service, the Billing Service, and the Notification Service:

Order Service ──► Inventory Service
             ──► Billing Service
             ──► Notification Service

The Order Service is coupled to three services. If any one of them is unavailable, the order flow breaks. If a new service needs order data, the Order Service must be modified.

In EDA, the Order Service emits a single event and knows nothing about consumers:

Order Service ──► [order.placed event] ──► Inventory Service
                                       ──► Billing Service
                                       ──► Notification Service
                                       ──► (any future service)

Temporal decoupling: consumers do not need to be running when the event is emitted. The broker holds the event until each consumer is ready to process it.

Spatial decoupling: the producer does not know consumer addresses, deployment locations, or even how many consumers exist.

Behavioural decoupling: adding a new consumer (Analytics Service, Fraud Detection) requires zero changes to the producer.

Anatomy of a Well-Designed Event

The Event Envelope

Every event should carry a consistent metadata envelope regardless of its type:

JSON

{
  "id": "01HX2K9P3QRSTVW",
  "type": "com.systemforge.order.placed",
  "source": "order-service",
  "specversion": "1.0",
  "time": "2026-04-18T10:30:00.000Z",
  "datacontenttype": "application/json",
  "dataschema": "https://schemas.systemforge.io/order/placed/v2.json",
  "correlationid": "user-session-abc123",
  "data": {
    "orderId": "ORD-1234",
    "customerId": "CUST-99",
    "items": [
      { "sku": "WIDGET-A", "quantity": 2, "unitPrice": 49.99 }
    ],
    "totalAmount": 99.98,
    "currency": "GBP",
    "placedAt": "2026-04-18T10:30:00.000Z"
  }
}

This follows the CloudEvents specification — a CNCF standard that makes events portable across brokers and platforms.

Event Design Principles

Include enough context to act. A consumer should be able to process the event without making additional API calls. If your OrderPlaced event only includes orderId, every consumer must call the Order Service to get the order details — you have not eliminated coupling, you have just moved it.

Do not include derived or computed data that may go stale. Events represent the state at the moment of occurrence. Do not include data that changes after the event is emitted.

Use specific event types. OrderPlaced and OrderCancelled are better than a generic OrderUpdated with a status field — consumers subscribe to what they care about.

Keep events small. Large events slow down the broker and increase deserialization cost for consumers that only need a few fields. If an event is large, consider a pointer event pattern: the event contains an ID, and consumers fetch the full record if needed.

Event Schema Evolution

Events are immutable records, but their schemas must evolve over time. Handle this without breaking existing consumers:

Backward-Compatible Changes (safe, no version bump)

Add a new optional field to the event data
Add a new event type

Breaking Changes (require major version bump)

Remove a field
Rename a field
Change a field's type
Change what an event means semantically

Schema Registry

Register all event schemas in a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, Azure Schema Registry):

Producers validate the event against the schema before publishing
Consumers validate the event against the schema before processing
Schema evolution is controlled — breaking changes are rejected unless explicitly approved

Avro and Protobuf are preferred over JSON Schema for event schemas in high-throughput systems because they are binary (smaller, faster), self-describing when used with a schema registry, and have robust evolution rules.

Choreography vs. Orchestration

When a business process involves multiple steps across multiple services, you have two coordination styles:

Choreography

Each service listens for events and reacts independently. No central coordinator exists.

OrderPlaced ──► Inventory Service → InventoryReserved
                                 → InventoryUnavailable

InventoryReserved ──► Payment Service → PaymentCharged
                                      → PaymentFailed

PaymentCharged ──► Shipping Service → ShipmentDispatched

Advantages:

No single point of failure or bottleneck
Services are truly independent — each can be deployed and scaled separately
Adding a new step requires no change to existing services

Disadvantages:

The overall business flow is implicit — it exists only in the sequence of events, not in any single place
Hard to monitor: "where is my order in the process?" requires correlating events across multiple services
Error handling is distributed — each service must decide what to do when its step fails

When to use: Long-running, independent workflows where loose coupling is more important than visibility.

Orchestration

A central orchestrator (process manager, workflow engine) drives the process. It sends commands to services and waits for results.

Order Orchestrator ──► command: ReserveInventory ──► Inventory Service
                   ◄── event: InventoryReserved ◄──

Order Orchestrator ──► command: ChargePayment ──► Payment Service
                   ◄── event: PaymentCharged ◄──

Order Orchestrator ──► command: DispatchShipment ──► Shipping Service
                   ◄── event: ShipmentDispatched ◄──

Advantages:

The business flow is explicit and visible in one place
Easier to monitor: the orchestrator knows the state of every in-progress order
Easier to handle failures: the orchestrator decides the compensation strategy

Disadvantages:

The orchestrator is a coupling point — all participating services must integrate with it
The orchestrator must be highly available (it becomes a critical path dependency)
Changes to the workflow require changing the orchestrator

When to use: Complex workflows where visibility, centralized error handling, and compliance traceability matter more than loose coupling. Common in financial transactions and healthcare workflows.

Choosing Between Them

| Factor | Choreography | Orchestration | |--------|-------------|---------------| | Coupling | Low | Moderate | | Visibility | Low — distributed across events | High — centralised | | Error handling | Complex — distributed | Clear — centralised | | Scalability | High | Limited by orchestrator | | Auditability | Hard | Easy | | Best for | High-throughput event pipelines | Business-critical workflows |

In practice, most systems use both: choreography for high-volume, loosely coupled event flows and orchestration for critical business processes that require full traceability.

Event Sourcing

Event sourcing stores state as a sequence of events rather than as a snapshot of the current state.

Traditional persistence: save the current state of the order record.

orders table: { orderId, status: "shipped", ... }

Event sourcing: save every event that occurred to the order.

order_events:
  OrderPlaced    { orderId, items, ... }           ts: 10:30
  PaymentCharged { orderId, amount, ... }           ts: 10:31
  ShipmentBooked { orderId, trackingNumber, ... }   ts: 10:35
  OrderShipped   { orderId, carrier, ... }          ts: 11:02

The current state is derived by replaying all events for an entity.

Benefits of Event Sourcing

Complete audit trail — you know not just what the current state is, but every change that led to it and when.

Temporal queries — reconstruct the state of any entity at any point in time by replaying events up to that timestamp.

Event replay — rebuild read models, fix bugs by replaying with corrected logic, or create new projections from historical data.

Natural integration — the event log is the source of truth and the integration feed simultaneously.

Challenges of Event Sourcing

Schema evolution — events are immutable, so historical events must be readable as schemas change. Use upcasters (functions that transform old event formats to current ones on read).

Eventual read model consistency — read models are projections built from the event stream; they may lag behind writes.

Snapshot performance — replaying thousands of events to compute current state is slow. Use periodic snapshots so you only replay events since the last snapshot.

Complexity — event sourcing adds significant architectural complexity. Use it where the audit trail and replay benefits genuinely justify the cost. Do not use it everywhere by default.

CQRS (Command Query Responsibility Segregation)

CQRS is a pattern that pairs naturally with event sourcing. It separates the write model (commands that change state) from the read model (queries that return data).

Write side:
  Command → Aggregate (apply business rules) → Events (persisted to event store)

Read side:
  Events → Projections (rebuild read models optimised for queries)
  Query → Read Model (returns data in the shape the UI or API needs)

Why separate them?

Write models are optimised for consistency and business rules
Read models are optimised for query performance and can be denormalised for fast retrieval
Read models can be destroyed and rebuilt from the event stream at any time
You can have multiple read models for different query needs (list view, detail view, analytics)

CQRS without event sourcing is also valid — you can use separate read and write databases with synchronisation via events, without storing full event history.

Designing for Failure in EDA

Event-driven systems have unique failure modes. Design for them explicitly.

Idempotent Consumers

Events can be delivered more than once (at-least-once delivery is the default in most brokers). Every consumer must be idempotent — processing the same event twice must produce the same result as processing it once.

Techniques:

Store a set of processed event IDs in a database; reject duplicates
Use database upsert (insert-or-update) operations rather than pure inserts
Design state transitions to be safe to re-apply (setting status to "shipped" twice has no negative effect)

Ordering

Most brokers do not guarantee global event ordering across partitions. Within a partition, ordering is guaranteed (Kafka) or configurable (Service Bus sessions).

Design strategy:

Include a sequence number or version in the event so consumers can detect out-of-order delivery
Route all events for the same entity to the same partition (by entity ID as partition key)
Design consumers to handle and buffer out-of-order events rather than assuming ordered delivery

Poison Events

A poison event is one that the consumer consistently fails to process — perhaps due to malformed data or a bug in the consumer. Without handling, it blocks all subsequent events in the queue.

Solution:

Implement a retry limit — after N attempts, route to the Dead Letter Queue (DLQ)
Alert on DLQ depth — any event in the DLQ requires investigation
Build a resubmission mechanism — fix the root cause, then replay from DLQ

Backpressure

When consumers cannot keep up with producers, the queue depth grows. Without backpressure, you eventually exhaust broker disk space or memory.

Strategies:

Scale consumers horizontally — add more consumer instances
Throttle producers — implement feedback to producers to slow emission
Prioritise processing — process high-priority events first
Alert on queue depth before it becomes critical

Observability in EDA

EDA is harder to observe than synchronous systems because a business transaction spans multiple services and multiple events with no single call stack.

Correlation IDs

Propagate a correlation ID through every event in a business flow. Every service that emits a downstream event copies the correlation ID from the incoming event into the outgoing event.