Event-Driven Architecture: Deep Dive

Event-Driven Architecture (EDA) is a design paradigm where system components communicate by producing and consuming events rather than calling each other directly. At small scale it looks like a nice decoupling technique. At system scale it becomes the only viable approach for throughput, resilience, and independent deployability.

This guide treats EDA as an architect would: topology decisions, delivery semantics trade-offs, Kafka internals that matter in production, and the optional (but powerful) extension into event sourcing.

What Is an Event?

An event records something that happened — a fact, immutable and past-tense. This is the conceptual foundation everything else rests on.

Command (intent):   "ProcessPayment"       → may be rejected
Event   (fact):     "PaymentProcessed"     → already happened, cannot be undone
Query   (read):     "GetOrderStatus"       → has no side effects

Getting this taxonomy right shapes the entire system. A queue of commands is a work distribution system. A log of events is a record of history — and that difference unlocks replay, audit, and event sourcing.

Event Taxonomy

| Type | Example | Owned by | Consumed by | |------|---------|----------|-------------| | Domain event | OrderPlaced | Domain aggregate | Same bounded context | | Integration event | OrderConfirmed | Publishing service | Other services | | Command message | ReserveInventory | Orchestrator | Single target service | | Notification event | UserEmailVerified | Auth service | Anyone interested |

Domain events stay inside one service. Integration events cross service boundaries and require stable contracts. Keep them separate — they evolve at different rates.

EDA Topology Patterns

Broker-Mediated (Standard)

Service A ──publish──► [Message Broker] ──deliver──► Service B
                                        ──deliver──► Service C
                                        ──deliver──► Service D

The broker absorbs temporal asymmetry: Service A is done immediately, B/C/D process at their own pace. The broker also provides durability, fan-out, and routing.

Event Streaming (Kafka Model)

Producers ──append──► [Partitioned Log] ◄──pull── Consumer Group A
                                        ◄──pull── Consumer Group B
                                        ◄──pull── Consumer Group C

The log is the source of truth. Consumers maintain their own position (offset) and can replay from any point. Multiple independent consumer groups consume the same log without interfering with each other. This is fundamentally different from traditional messaging — the broker does not know or care whether a message was "consumed."

Choreography vs Orchestration

These are coordination styles, not topology patterns. Both can run on the same broker.

Choreography — services react to events, no coordinator:

OrderService publishes  → OrderPlaced
InventoryService reacts → StockReserved
PaymentService reacts   → PaymentProcessed
ShippingService reacts  → ShipmentScheduled

Pros: loose coupling, services can be added without touching existing ones.
Cons: the business flow is invisible — it only exists as emergent behaviour across services. Debugging requires correlating traces across all services.

Orchestration — a saga coordinator drives the flow:

OrderSaga:
  1. Send ReserveStock  → InventoryService  → await StockReserved / StockFailed
  2. Send ChargePayment → PaymentService    → await PaymentProcessed / PaymentFailed
  3. Send ScheduleShip  → ShippingService   → await ShipmentScheduled

Pros: the business flow is explicit and traceable. Compensations are centrally managed.
Cons: the orchestrator becomes a coupling point — it must know about every participant.

Decision rule: choreography for loosely related cross-domain events; orchestration for multi-step business transactions that require compensation.

Apache Kafka: Internals That Matter

Kafka is not a message queue with extra features. It is a distributed, replicated, partitioned append-only log. Understanding that changes how you design with it.

Partitions: The Unit of Parallelism

A Kafka topic is split into N partitions. Each partition is an independent ordered log stored on disk. Ordering is only guaranteed within a partition — not across partitions.

Topic: order-events  (3 partitions)

Partition 0: [offset 0] [offset 1] [offset 2] [offset 3] ...
Partition 1: [offset 0] [offset 1] [offset 2] ...
Partition 2: [offset 0] [offset 1] ...

Partition key determines which partition a message goes to:

hash(key) % num_partitions = partition index

All messages with the same key go to the same partition — in order. This is how you guarantee ordering for a specific entity (e.g., all events for orderId = ORD-1234).

No key? Kafka distributes round-robin. You get throughput but lose per-entity ordering.

Consumer Groups: Parallelism with Coordination

A consumer group is a set of consumers that collectively consume a topic. Kafka assigns partitions to consumers — each partition goes to exactly one consumer in the group.

Topic: order-events  (4 partitions)

Consumer Group: inventory-service (3 instances)
  Instance 1 → Partition 0, Partition 1
  Instance 2 → Partition 2
  Instance 3 → Partition 3

Consumer Group: analytics-pipeline (1 instance)
  Instance 1 → Partition 0, 1, 2, 3   (all partitions)

Rules:

More consumers than partitions: some consumers sit idle. Scale by adding partitions first.
More partitions than consumers: consumers handle multiple partitions.
One consumer per group gets each message — competitive consumption within a group.
Multiple groups get all messages — independent fan-out between groups.

Adding a consumer to a group triggers a rebalance — all consumers briefly pause while Kafka reassigns partitions. This is a production concern for high-throughput systems (mitigated with static membership and incremental cooperative rebalancing in Kafka 2.4+).

Offsets and Consumer Position

Consumers commit their offset after processing. Kafka stores these commits in the internal __consumer_offsets topic.

Partition 0: [0][1][2][3][4][5]
                           ↑
                     committed offset = 4
                     (messages 0-3 processed, 4 is next)

What happens when a consumer fails mid-batch?

At-least-once: commit after processing. Duplicate on crash, but no message lost.
At-most-once: commit before processing. No duplicate, but message lost on crash.
Exactly-once: transactional Kafka API (covered below).

In .NET with Confluent.Kafka:

var config = new ConsumerConfig
{
    BootstrapServers = "kafka:9092",
    GroupId = "inventory-service",
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false   // manual commit for at-least-once
};

using var consumer = new ConsumerBuilder<string, string>(config).Build();
consumer.Subscribe("order-events");

while (!stoppingToken.IsCancellationRequested)
{
    var result = consumer.Consume(stoppingToken);
    try
    {
        await ProcessEventAsync(result.Message.Value);
        consumer.Commit(result);   // commit only after successful processing
    }
    catch (RetryableException)
    {
        // do not commit — message will be redelivered on restart
    }
    catch (PoisonPillException)
    {
        await SendToDeadLetterTopicAsync(result.Message);
        consumer.Commit(result);   // commit to unblock the partition
    }
}

Exactly-Once Semantics (EOS)

Kafka supports exactly-once end-to-end when both producer and consumer participate:

Idempotent producer: Kafka deduplicates retried messages from the same producer instance.

var config = new ProducerConfig
{
    BootstrapServers = "kafka:9092",
    EnableIdempotence = true,       // PID + sequence number based dedup
    Acks = Acks.All,
    MaxInFlight = 5
};

Transactional producer: atomic multi-partition writes. Either all messages in the transaction commit, or none do.

var config = new ProducerConfig
{
    EnableIdempotence = true,
    TransactionalId = "order-service-producer-1"  // unique per producer instance
};

producer.InitTransactions(TimeSpan.FromSeconds(10));
producer.BeginTransaction();
try
{
    producer.Produce("order-events", new Message<string, string>
        { Key = order.Id, Value = JsonSerializer.Serialize(orderPlacedEvent) });
    producer.Produce("audit-log", new Message<string, string>
        { Key = order.Id, Value = auditEntry });
    producer.CommitTransaction();
}
catch
{
    producer.AbortTransaction();
    throw;
}

Read-process-write EOS: when a service consumes from topic A and produces to topic B, use transactions to commit the offset and the output message atomically — either both happen or neither does.

EOS has a throughput cost (~20-40%). Use it for financial workflows. For analytics pipelines, at-least-once with idempotent consumers is usually the right trade-off.

Async Systems: Design Concerns

Backpressure

When consumers are slower than producers, the gap grows. Without backpressure handling, this leads to memory exhaustion, consumer lag spikes, and eventually data loss or OOM crashes.

Kafka's answer: consumers pull at their own pace. The log retains messages regardless. Consumer lag is visible (and alertable) via kafka-consumer-groups.sh --describe or consumer lag metrics.

Design responses to consumer lag:

Scale consumers (add instances, add partitions)
Batch processing (process N messages per poll, reduce per-message overhead)
Shed load gracefully (dead-letter non-critical events, prioritise critical ones)

In .NET, control batch size:

// Confluent.Kafka consumer config
MaxPollIntervalMs = 300000,   // max time between polls before considered dead
FetchMaxBytes = 52428800,     // 50MB max fetch per request
MaxPartitionFetchBytes = 1048576  // 1MB per partition per fetch

Ordering Guarantees

| Guarantee | How to achieve | Cost | |-----------|---------------|------| | Global order | Single partition, single consumer | No parallelism | | Per-entity order | Partition by entity ID (orderId, userId) | Key skew risk | | Causal order | Vector clocks or event sequence numbers in payload | Complexity | | No order guarantee | Random partitioning, any consumer | Maximum throughput |

Most business systems need per-entity ordering. Hash-partition on the natural key and document the ordering contract explicitly.

Key skew: if one entity generates far more events than others (e.g., a mega-tenant), all their events go to one partition — one consumer handles disproportionate load. Mitigate with composite keys (tenantId + entityId) or partition reassignment.

Idempotency: The Non-Negotiable

In any at-least-once system, consumers will receive duplicate messages. Every consumer must be idempotent — processing the same message twice must produce the same result as processing it once.

Patterns:

Natural idempotency: the operation is inherently repeatable (setting a field to a value is idempotent; incrementing a counter is not).

Deduplication store: persist processed event IDs with a TTL. Check before processing.

public async Task HandleAsync(OrderPlacedEvent evt)
{
    if (await _deduplicationStore.HasProcessedAsync(evt.EventId))
        return;

    await _inventoryService.ReserveAsync(evt.OrderId, evt.Items);
    await _deduplicationStore.MarkProcessedAsync(evt.EventId, TimeSpan.FromDays(7));
}

Database unique constraint: insert with the event ID as a unique key. Duplicate produces a constraint violation you catch and swallow.

Optimistic concurrency: use the event sequence number as expected version. Out-of-order or duplicate updates fail the version check.

Schema Evolution

Events are long-lived. Services consuming an event schema may be on different versions. Schema changes must be backward compatible (new consumers can read old messages) or forward compatible (old consumers can read new messages).

Safe changes:

Add optional fields with defaults
Add new event types

Breaking changes (require a migration strategy):

Remove fields
Rename fields
Change field types

Use a schema registry (Confluent Schema Registry, Azure Schema Registry) to enforce compatibility rules:

BACKWARD:  new schema can read data written by old schema  (add optional fields)
FORWARD:   old schema can read data written by new schema  (old consumers ignore new fields)
FULL:      both directions (safest, most restrictive)

In .NET with Avro + Confluent Schema Registry:

var schemaRegistry = new CachedSchemaRegistryClient(new SchemaRegistryConfig
{
    Url = "http://schema-registry:8081"
});

var producer = new ProducerBuilder<string, OrderPlaced>(config)
    .SetValueSerializer(new AvroSerializer<OrderPlaced>(schemaRegistry,
        new AvroSerializerConfig { AutoRegisterSchemas = false }))
    .Build();

Setting AutoRegisterSchemas = false means an incompatible schema change will fail at the producer — not silently corrupt downstream consumers.

Event Sourcing

Event sourcing is an optional but powerful extension of EDA: instead of storing the current state of an entity, you store the sequence of events that led to that state. The current state is derived by replaying the events.

Traditional:
  orders table:  { id: ORD-1, status: "Shipped", total: 299.99, ... }

Event Sourced:
  order_events:
    { seq: 1, type: "OrderPlaced",    data: { items: [...], total: 299.99 } }
    { seq: 2, type: "PaymentTaken",   data: { amount: 299.99 } }
    { seq: 3, type: "ItemsPicked",    data: { warehouseId: "WH-3" } }
    { seq: 4, type: "OrderShipped",   data: { trackingId: "TRK-99" } }

When to Use Event Sourcing

Use it when:

Audit log is a hard requirement (financial, healthcare, compliance) — the event store is the audit log
Temporal queries are needed — "what was the state of this order at 14:32?"
Event-driven integration is already in place — the events you source are the same events you publish
Business rules evolve — you can replay history against new business logic

Do not use it when:

Simple CRUD with no business logic
Team has no prior exposure (learning curve is significant)
Read patterns dominate and projections would be complex

Aggregate and Event Store

An aggregate is the unit of consistency. All events for one aggregate are stored together, in sequence. The aggregate version (last sequence number) is used for optimistic concurrency.

public class Order : AggregateRoot
{
    private List<OrderItem> _items = new();
    public OrderStatus Status { get; private set; }
    public Money Total { get; private set; }

    public static Order Place(OrderId id, List<OrderItem> items, CustomerId customerId)
    {
        var order = new Order();
        order.Apply(new OrderPlaced(id, customerId, items, DateTimeOffset.UtcNow));
        return order;
    }

    public void Ship(TrackingId trackingId)
    {
        if (Status != OrderStatus.Paid)
            throw new InvalidOperationException("Order must be paid before shipping.");
        Apply(new OrderShipped(Id, trackingId, DateTimeOffset.UtcNow));
    }

    private void Apply(OrderPlaced evt)
    {
        Id = evt.OrderId;
        _items = evt.Items.ToList();
        Total = evt.Total;
        Status = OrderStatus.Pending;
        AddDomainEvent(evt);   // queue for publishing after save
    }

    private void Apply(OrderShipped evt)
    {
        Status = OrderStatus.Shipped;
        AddDomainEvent(evt);
    }
}

Event store append with optimistic concurrency:

public async Task SaveAsync(AggregateRoot aggregate, int expectedVersion)
{
    var events = aggregate.GetPendingEvents();

    var parameters = new DynamicParameters();
    parameters.Add("@AggregateId", aggregate.Id.ToString());
    parameters.Add("@ExpectedVersion", expectedVersion);

    // Each event insert increments version; fails if expectedVersion doesn't match
    foreach (var evt in events)
    {
        await _connection.ExecuteAsync(@"
            INSERT INTO event_store (aggregate_id, version, event_type, payload, occurred_at)
            SELECT @AggregateId,
                   (SELECT COALESCE(MAX(version), 0) FROM event_store WHERE aggregate_id = @AggregateId) + 1,
                   @EventType, @Payload, @OccurredAt
            WHERE (SELECT COALESCE(MAX(version), 0) FROM event_store WHERE aggregate_id = @AggregateId) = @ExpectedVersion",
            new { AggregateId = aggregate.Id, ExpectedVersion = expectedVersion,
                  EventType = evt.GetType().Name, Payload = Serialize(evt),
                  OccurredAt = DateTimeOffset.UtcNow });

        expectedVersion++;
    }
}

Projections (Read Models)

The event store is write-optimised and terrible for queries. Projections subscribe to the event stream and build denormalised read models optimised for specific query patterns.

public class OrderSummaryProjection : IEventHandler<OrderPlaced>,
                                       IEventHandler<OrderShipped>
{
    private readonly IOrderSummaryRepository _repo;

    public async Task HandleAsync(OrderPlaced evt)
    {
        await _repo.UpsertAsync(new OrderSummary
        {
            OrderId    = evt.OrderId,
            CustomerId = evt.CustomerId,
            Total      = evt.Total,
            Status     = "Pending",
            PlacedAt   = evt.OccurredAt
        });
    }

    public async Task HandleAsync(OrderShipped evt)
    {
        await _repo.UpdateStatusAsync(evt.OrderId, "Shipped", evt.TrackingId);
    }
}

Projections can be rebuilt at any time by replaying the event stream. This is the "time machine" capability — rebuild a projection with different logic, or create a new read model without touching historical data.

Event Sourcing + CQRS

Event sourcing and CQRS (Command Query Responsibility Segregation) are distinct patterns that compose naturally:

Command side:                    Query side:
  Command handler                  Read from projections
    → Load aggregate (replay)         (SQL, Redis, Elasticsearch)
    → Apply business logic               ↑
    → Append events to store        Projection handlers
         │                              subscribe to event stream
         └────► publish to Kafka ──────►

CQRS without event sourcing is viable (and simpler). Event sourcing without CQRS is painful — the event store is a poor query target. Together they form a consistent architecture for complex domains.

Production Observability for EDA

An async system that is opaque is dangerous. You need three things:

1. Distributed Tracing with Correlation IDs

Propagate a correlationId through every event. Every service that handles the event logs it, and traces can be reconstructed across service boundaries.

public class OrderPlaced
{
    public string EventId      { get; init; } = Guid.NewGuid().ToString();
    public string CorrelationId { get; init; }  // propagated from the originating request
    public string CausationId  { get; init; }   // the event that caused this one
    // ... business fields
}

2. Consumer Lag Monitoring

Consumer lag (how far behind a consumer group is from the latest offset) is the primary health indicator for event-driven systems. Alert before it becomes a problem.

Key metrics:

kafka_consumer_lag per group per partition
kafka_consumer_lag_max (worst partition in the group)
Processing time per message (latency SLA)

3. Dead Letter Topic Monitoring

Any message reaching the DLT (dead letter topic) represents a processing failure that requires investigation. Alert on DLT message count > 0. Never let DLTs silently fill up.

Architecture Decision Map

| Decision | When to choose A | When to choose B | |----------|-----------------|-----------------| | Kafka vs AMQP broker (A=Kafka, B=RabbitMQ/Azure SB) | Replay, multiple independent consumers, high throughput, audit | Complex routing, per-message TTL, simple request/reply | | Choreography vs Orchestration | Cross-domain notifications, independent reactions | Multi-step business transactions, compensation logic | | At-least-once vs EOS | Idempotent consumers, analytics, pipeline | Financial transactions, billing, inventory mutation | | Event Sourcing vs State | Audit requirement, temporal queries, complex domain | CRUD, reporting-heavy, small team | | Schema Registry vs JSON | Multiple teams, long-lived topics, schema evolution risk | Internal single-service topics, prototyping |

Key Principles

Events are facts, not commands. If the consumer cannot reject it, model it as an event. If it might not happen, model it as a command.

Partition on the key that drives ordering. Get this wrong early and it is expensive to change — partition count changes require manual reassignment.

Design for at-least-once from day one. Idempotency is not an afterthought. Every consumer operation must be safe to repeat.

Consumer lag is your system's pulse. An unmonitored lag spike is a silent incident.

Event sourcing earns its complexity. Use it for the domains where audit, replay, and temporal query are genuine requirements — not as a default architecture.

Related: Distributed Systems Patterns — Saga, Outbox, Circuit Breaker
Related: Messaging Systems Deep Dive — Queues, Topics, JMS, Kafka basics