Back to blog
Integration Engineeringintermediate

Messaging Systems

Deep dive into messaging systems: queues, brokers, delivery guarantees, message routing, dead letter queues, backpressure, Kafka vs traditional brokers, and how to choose and operate messaging infrastructure for production integration workloads.

SystemForgeApril 18, 202614 min read
Messaging SystemsMessage BrokersKafkaRabbitMQAzure Service BusQueues
Share:𝕏

Messaging systems are the infrastructure that makes asynchronous integration possible. They store messages while producers and consumers run at different rates, survive transient failures, and scale to millions of messages per second. Understanding how they work under the hood — not just how to use them — is what separates architects who design robust integration systems from those who fight fires in production. This lesson covers the internals, the operational realities, and the decision framework for choosing the right messaging system.


Why Messaging Systems Exist

The direct alternative to a messaging system is a direct synchronous call:

Producer ──► Consumer (synchronous)

This fails the moment the consumer is unavailable, overloaded, or slow. The producer must either wait (latency impact) or give up (data loss).

A messaging system decouples the producer and consumer in time:

Producer ──► [Broker stores message] ──► Consumer (when ready)

The producer writes and moves on. The consumer reads when it is ready. The broker absorbs the difference in pace and survives transient failures on both sides.

Core value propositions:

  • Temporal decoupling — producer and consumer do not need to be running simultaneously
  • Load levelling — absorb bursts of messages; deliver at a pace the consumer can handle
  • Reliability — messages are not lost if the consumer crashes mid-processing
  • Scalability — add consumers to share the load from a single queue

The Message Broker Architecture

A message broker is the server that receives, stores, routes, and delivers messages. Its core components:

Producer ──► [Broker]
             ├── Exchange / Router (decides where to send)
             ├── Queue / Topic (stores messages)
             └── Delivery engine ──► Consumer

Queues

A queue is a channel that stores messages in order (FIFO) and delivers each message to exactly one consumer.

Key properties to configure per queue:

  • Durability — persist messages to disk (survive broker restart) or keep in memory only
  • Max message size — enforce limits to prevent enormous payloads from degrading performance
  • Message TTL — automatically expire messages that have not been consumed within a time window
  • Max queue depth — the maximum number of unprocessed messages before overflow handling kicks in
  • Dead letter settings — where failed or expired messages are routed

Broker Cluster Architecture

Production brokers run as clusters, not single nodes:

Replication: each queue or partition is replicated to multiple broker nodes. If one node fails, another takes over without message loss.

Leader/follower: one broker node is the leader for each queue/partition (handles writes and reads); others are followers (replicate the leader's state).

Quorum: a write is only acknowledged to the producer once a majority of replicas have stored the message. This prevents data loss even if the leader crashes immediately after acknowledgement.


Delivery Guarantees

The most important characteristic of any messaging system is its delivery guarantee. Understand what you are getting before you depend on it.

At-Most-Once

The broker sends the message once and does not track whether it was received. If the consumer crashes after receiving but before processing, the message is lost.

When to use: high-frequency telemetry, metrics, or logging where occasional loss is acceptable and throughput is more important than reliability.

At-Least-Once

The broker delivers the message and waits for an acknowledgement. If the acknowledgement does not arrive within a timeout (consumer crashed, network failure), the broker redelivers.

Consequence: duplicates are possible. Consumer must be idempotent.

This is the default and correct choice for most integration scenarios.

Exactly-Once

Every message is delivered exactly once. Requires coordination between the broker and the consumer's storage system.

Kafka implements this via:

  • Idempotent producer: each message is assigned a sequence number; the broker deduplicates on receive
  • Transactional API: produce and consume operations are wrapped in a distributed transaction that commits or rolls back atomically

Reality check: exactly-once is expensive (higher latency, lower throughput) and is only achievable end-to-end when both the broker and the consumer's downstream writes participate in the same transaction. For most use cases, at-least-once + idempotent consumer is the better trade-off.


Message Acknowledgement

Acknowledgement (ack) is the signal from consumer to broker that a message was successfully processed. Until the broker receives an ack, it considers the message "in-flight" and may redeliver it.

Ack Modes

Auto-ack: the broker marks the message as delivered the moment it is sent to the consumer. No redelivery if the consumer crashes during processing. Use only for at-most-once scenarios.

Manual ack: the consumer explicitly acknowledges after successful processing. The broker redelivers if no ack arrives within the visibility timeout.

Consumer receives message
Consumer processes message
  → success: consumer sends ACK → broker deletes message
  → failure: consumer sends NACK (or acks not received) → broker redelivers

Visibility Timeout / Lock Duration

When a consumer receives a message, the broker locks it for a configurable duration (the visibility timeout or lock duration). The message is hidden from other consumers while locked.

  • If the consumer acks within the timeout: message deleted from queue
  • If the timeout expires before an ack: message becomes visible again and another consumer can pick it up

Set the visibility timeout to at least 2× the expected processing time. If it is too short, messages are redelivered while still being processed — causing duplicates.


Message Routing

Different brokers handle routing differently. Understanding the model determines how you design topics, queues, and consumer topology.

RabbitMQ Exchange Model

RabbitMQ separates routing (exchange) from storage (queue). Producers publish to an exchange; the exchange routes to queues based on binding rules.

Exchange types:

Direct exchange — routes to queues that have a matching binding key:

Producer → exchange (routing key: "order.placed") → queue: order-processor

Topic exchange — routes using wildcard patterns:

"order.*"  → matches order.placed, order.cancelled
"*.placed" → matches order.placed, payment.placed
"#"        → matches everything

Fanout exchange — broadcasts to all bound queues (no routing key):

Producer → fanout exchange → queue A
                          → queue B
                          → queue C

Headers exchange — routes based on message header attributes instead of routing key.

This flexibility makes RabbitMQ powerful for complex routing topologies but adds configuration overhead.

Kafka Partition Model

Kafka does not route — producers write to a named topic and messages are distributed across partitions.

topic: orders (4 partitions)
  Partition 0: messages with hash(key) % 4 == 0
  Partition 1: messages with hash(key) % 4 == 1
  Partition 2: messages with hash(key) % 4 == 2
  Partition 3: messages with hash(key) % 4 == 3

Partition key determines which partition a message goes to:

  • Same key → always same partition → ordered processing for that key
  • No key → round-robin across partitions → maximum throughput, no ordering

Consumer group — a set of consumers that share the work of consuming a topic. Each partition is assigned to exactly one consumer in the group:

topic: orders (4 partitions)
consumer group: order-processor (3 instances)
  Instance 1 → Partition 0, Partition 1
  Instance 2 → Partition 2
  Instance 3 → Partition 3

Adding consumers to a group rebalances partition assignment — horizontal scaling is trivial.

Azure Service Bus Model

Azure Service Bus uses queues (point-to-point) and topics with subscriptions (pub/sub):

  • Queue: each message delivered to one consumer (competing consumers supported)
  • Topic: each message delivered to all active subscriptions
  • Subscription filter: SQL-like expressions on message properties control which messages a subscription receives

Sessions: messages with the same session ID are delivered in order to a single consumer — enables ordered processing within a partitioned topic.


Backpressure

Backpressure is the mechanism by which a slow consumer signals to the system that it cannot keep up, preventing it from being overwhelmed.

Symptoms of Missing Backpressure

  • Queue depth grows indefinitely until the broker runs out of disk space
  • Consumer memory grows until it crashes trying to process a backlog
  • Producer keeps writing at full speed while the consumer falls further behind

Backpressure Strategies

Queue depth monitoring with autoscaling:
Monitor queue depth. When it exceeds a threshold, scale out consumer instances automatically (Kubernetes KEDA, Azure Container Apps scale rules).

Queue depth > 1000 messages → scale consumers from 2 to 8 instances
Queue depth < 100 messages  → scale consumers back to 2 instances

Rate-limited consumers:
Consumer explicitly limits how many messages it fetches per poll cycle. It processes at a fixed rate regardless of queue depth.

C#
// Pull at most 10 messages per batch
var messages = await receiver.ReceiveMessagesAsync(maxMessages: 10);

Throttling producers:
Producers check queue depth before writing. If depth exceeds a threshold, the producer pauses or slows down. This requires a feedback channel from broker to producer.

Circuit breaker on producers:
If the broker rejects writes (queue full, disk full), the producer opens its circuit breaker and stops writing until the broker recovers.


Dead Letter Queues (DLQ) in Depth

A Dead Letter Queue is where messages go when they cannot be processed. Every production queue needs one.

Reasons Messages Are Dead-Lettered

| Reason | Example | |--------|---------| | Max delivery count exceeded | Consumer keeps NACKing after 5 attempts | | Message TTL expired | Message waited in queue longer than its time-to-live | | Consumer explicitly dead-letters | Consumer determines message is unprocessable | | Queue full overflow | Queue depth limit reached, new messages overflow to DLQ | | Filter evaluation failure | Message cannot be evaluated by subscription filter |

DLQ Message Structure

When a message is dead-lettered, the broker annotates it with diagnostic metadata:

JSON
{
  "originalMessageId": "msg-9876",
  "deadLetterReason": "MaxDeliveryCountExceeded",
  "deadLetterErrorDescription": "NullReferenceException at OrderTransformer.Transform()",
  "deadLetterSource": "order-processor-subscription",
  "enqueuedTime": "2026-04-18T10:30:00Z",
  "deadLetteredAt": "2026-04-18T10:35:22Z",
  "deliveryAttempts": 5,
  "originalPayload": { ... }
}

DLQ Operations Runbook

  1. Alert — PagerDuty / OpsGenie alert when DLQ depth > 0
  2. Triage — read DLQ messages to categorise error types
  3. Root cause — identify whether this is a code bug, data quality issue, or configuration problem
  4. Fix — deploy the fix before touching the DLQ
  5. Resubmit — replay DLQ messages to the original queue in small batches
  6. Monitor — watch for the same messages returning to the DLQ after resubmission

Message Serialisation

The format used to serialise message payloads significantly impacts throughput, schema evolution, and interoperability.

JSON

Pros: human-readable, universally supported, easy to debug
Cons: verbose (larger payload size), no schema enforcement at serialisation time, no binary efficiency

Use JSON for: external-facing APIs, low-to-medium throughput, teams with mixed language stacks.

Apache Avro

Pros: compact binary format, schema embedded in registry (not in each message), excellent schema evolution support
Cons: not human-readable, requires schema registry infrastructure

Use Avro for: high-throughput Kafka pipelines where payload size and throughput matter.

Protocol Buffers (Protobuf)

Pros: very compact binary, extremely fast serialisation, strongly typed, good multi-language support
Cons: not human-readable, schema (.proto files) must be distributed to all clients

Use Protobuf for: gRPC-based internal service messaging, latency-sensitive pipelines.

MessagePack

Pros: binary JSON equivalent — same structure as JSON but binary-encoded (60% smaller)
Cons: less ecosystem support than JSON or Protobuf

Choosing a Format

External / public API   → JSON
High-throughput Kafka   → Avro (with schema registry)
Internal gRPC services  → Protobuf
Moderate throughput     → JSON or MessagePack

Kafka vs. Traditional Message Brokers

Kafka occupies a unique position in the messaging landscape. Understanding how it differs from traditional brokers (RabbitMQ, Azure Service Bus, IBM MQ) determines when to use each.

| Property | Traditional Broker (RabbitMQ, SB) | Apache Kafka | |----------|-----------------------------------|--------------| | Model | Queue / Topic with subscriptions | Distributed commit log | | Message deletion | After acknowledgement | After retention period (time or size) | | Replay | Not possible after ack | Possible — consumers set their own offset | | Ordering | Per-queue (FIFO) | Per-partition (by key) | | Throughput | High (hundreds of thousands/sec) | Very high (millions/sec per cluster) | | Latency | Low (ms) | Low (ms, slightly higher than SB) | | Consumer model | Push (broker pushes to consumers) | Pull (consumers poll broker) | | Routing | Exchange rules, subscription filters | Partition key, consumer groups | | Best for | Task queues, workflow, ordered processing | Event streaming, audit log, data pipelines | | Operational complexity | Moderate | Higher (ZooKeeper / KRaft, partition management) |

Use Kafka when:

  • You need to replay historical events
  • You need indefinite retention
  • Throughput requirements exceed what traditional brokers handle comfortably
  • Multiple independent teams will consume the same event stream at different rates

Use a traditional broker when:

  • You need complex routing (RabbitMQ exchange topologies)
  • You need transactional messaging (Azure Service Bus with sessions)
  • Operational simplicity is more important than throughput
  • You are already in an ecosystem that provides it managed (Azure Service Bus, AWS SQS/SNS)

Operational Considerations

Monitoring Metrics

Every messaging system must export these metrics to your monitoring platform:

| Metric | Alert Condition | |--------|-----------------| | Queue depth (per queue/subscription) | Sustained growth over 10 minutes | | Consumer lag (Kafka) | Lag growing over time | | Message processing rate | Drop > 20% from baseline | | Error rate (NACKs, failed acks) | > 1% of messages | | DLQ depth | Any message present | | Broker disk usage | > 70% | | Broker CPU / memory | > 80% sustained | | Replication lag | Any out-of-sync replica |

Capacity Planning

Plan for 3× peak expected throughput. Messaging systems are not graceful under sustained overload — they degrade suddenly once a resource (disk, memory, network) is exhausted.

Retention sizing: for Kafka, calculate storage as:

daily_event_volume × avg_message_size_bytes × replication_factor × retention_days

Security

  • Encryption in transit: TLS for all producer and consumer connections
  • Authentication: SASL/SCRAM (Kafka), SAS tokens or managed identity (Azure Service Bus), IAM roles (AWS)
  • Authorisation: per-topic ACLs — producers have write-only access; consumers have read-only access
  • Encryption at rest: enable broker-level encryption for messages stored on disk
  • Network isolation: brokers should be inside a private VNet/VPC; never expose broker ports to the public internet

Choosing a Messaging System: Decision Guide

Do you need to replay historical events?
  Yes → Kafka (or Azure Event Hub for managed Kafka)
  No  → continue

Is your primary cloud Azure?
  Yes → Azure Service Bus (for queues/workflows) or Azure Event Hubs (for streaming)
  No  → continue

Is your primary cloud AWS?
  Yes → Amazon SQS (queues) + SNS (fan-out) or Amazon MSK (managed Kafka)
  No  → continue

Do you need complex routing (wildcard, header-based)?
  Yes → RabbitMQ
  No  → continue

Do you need guaranteed ordering per entity?
  Yes → Kafka (partition key) or Azure Service Bus (sessions)
  No  → SQS, RabbitMQ, or your cloud provider's managed queue

Is this on-premises or air-gapped?
  Yes → RabbitMQ or Apache Kafka self-hosted
  No  → prefer managed cloud service

Lesson Summary

  • Messaging systems decouple producers and consumers in time, absorb load bursts, and guarantee message delivery despite transient failures — none of which synchronous HTTP can do.
  • At-least-once delivery is the standard guarantee. Build idempotent consumers rather than relying on broker-level exactly-once semantics.
  • Manual acknowledgement with a visibility timeout is the correct default. Set the timeout to at least 2× expected processing time.
  • Backpressure must be designed in: monitor queue depth, autoscale consumers, and rate-limit consumers so they never accept more work than they can process.
  • Dead letter queues are diagnostic tools, not bins. Alert on DLQ depth, investigate before resubmitting, and fix root causes first.
  • Kafka's commit log model makes it uniquely suited to replay, high-throughput streaming, and multi-consumer pipelines. Traditional brokers are simpler to operate and better for complex routing and transactional workflows.

Course complete. You now have a complete command of the four foundational integration patterns. Apply them in the capstone project: design the communication layer for a distributed order-processing system using each pattern where it fits best.

RabbitMQ & Messaging Knowledge Check

5 questions · Test what you just learned · Instant explanations

Enjoyed this article?

Explore the Integration Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.