Lecture 5: System Integration Monitoring

An integration that nobody is watching is an integration that will eventually fail silently. Messages stop flowing, data falls out of sync, and business processes stall — all before anyone notices. Monitoring is not optional; it is a core responsibility of anyone who runs integration systems in production. This lecture covers how to build the monitoring capability that lets you know about problems before users do.

Why Integration Monitoring Is Different

Monitoring application servers is relatively straightforward — if the server is up and responding to health checks, it is working. Integration monitoring is harder:

Failures can be silent. An integration might be running but processing zero messages because the upstream system stopped sending. The integration infrastructure is healthy; the business process is broken.
Failures can be partial. 95% of messages process correctly; 5% fail silently and go nowhere. From the infrastructure perspective, the integration is running fine.
Failures span systems. A message published by System A might fail in the integration layer, or succeed there but fail in System B. Tracking the full journey requires correlation across multiple logs.
Time matters. A 10-minute delay in a file transfer might be acceptable. A 10-minute delay in a payment confirmation is a business incident.

The Three Pillars of Observability

Pillar 1: Logs

Logs record events that happened. For integrations, every meaningful step of every transaction should produce a log entry.

What makes a good integration log entry:

JSON

{
  "timestamp": "2026-04-18T10:30:05.123Z",
  "level": "INFO",
  "integration": "order-to-warehouse",
  "correlationId": "user-req-abc-123",
  "messageId": "msg-9876-xyz",
  "sourceSystem": "ERP",
  "targetSystem": "WMS",
  "stage": "send",
  "status": "success",
  "durationMs": 142
}

Key fields:

correlationId — the same ID used across all systems for one business transaction. This is how you trace a transaction end-to-end across multiple log sources.
messageId — unique ID of this specific message
integration — which flow produced this log
stage — where in the flow (receive, validate, transform, route, send, acknowledge)
status — success, failure, retry, dead-lettered
durationMs — how long this step took

Structured logging (JSON format) is critical. It allows log aggregation tools (Elasticsearch, Azure Monitor, Splunk) to parse and query across fields — impossible with free-text log strings.

Pillar 2: Metrics

Metrics are numerical measurements over time. Unlike logs (one entry per event), metrics are aggregated at regular intervals.

Essential integration metrics:

| Metric | Description | Unit | |--------|-------------|------| | Message throughput | Messages processed per time period | messages/second or messages/hour | | Processing latency | Time from message receipt to completion | milliseconds (p50, p95, p99) | | Error rate | Percentage of messages that failed | % | | Queue depth | Number of messages waiting to be processed | count | | DLQ depth | Number of messages in the dead letter queue | count | | Consumer lag (Kafka) | How far behind consumers are | message count or seconds | | Integration availability | Is the integration processing messages? | boolean / % uptime |

Percentiles matter for latency: the average latency is misleading. The p95 (95th percentile) tells you what 95% of your users experience. The p99 tells you about the worst 1%.

Pillar 3: Traces

A distributed trace tracks the full journey of a single business transaction across all systems and services. Each step is a span with a start time, duration, and success/failure status.

trace: order-1234 (total: 342ms)
  ├── span: ERP publish         12ms  ✓
  ├── span: Integration receive  3ms  ✓
  ├── span: Transform           28ms  ✓
  ├── span: Validate             5ms  ✓
  ├── span: WMS send           290ms  ✓
  └── span: Acknowledge          4ms  ✓

Traces are the most powerful diagnostic tool for finding bottlenecks and pinpointing exactly where a transaction failed.

Implementation: propagate a trace ID in message headers across all systems. Use OpenTelemetry — a vendor-neutral tracing standard supported by all major cloud platforms.

What to Monitor: Integration-Specific Checks

Beyond the standard infrastructure metrics (CPU, memory, disk), integrations need domain-specific monitoring:

Heartbeat / Activity Monitoring

The most important check: is the integration actually processing messages?

Configure a monitor that checks:

Timestamp of the last successful message processed
Number of messages processed in the last hour

If a scheduled daily integration has not run by its expected time, alert immediately — even if all infrastructure metrics look healthy.

YAML

# Alert: integration has been idle for too long
condition: time_since_last_message(integration="order-sync") > 2h
severity: P2
notify: operations-team

Queue Depth Alerting

A growing queue depth means consumers are falling behind producers. This is an early warning of:

Consumer performance degradation
Consumer crash
Downstream system slowdown or failure
Traffic spike

Alert at two thresholds:

Warning: queue depth growing steadily for 10 minutes
Critical: queue depth exceeds a configured maximum

Error Rate Alerting

A normal integration has a near-zero error rate for business-rule errors. A spike indicates:

Data quality degradation in the source system
Schema change in the source that was not communicated
Bug introduced in a recent deployment

YAML

# Alert: error rate spike
condition: error_rate(integration="order-sync", window=5m) > 1%
severity: P2
notify: dev-team

DLQ Depth Alerting

Any message in a Dead Letter Queue requires investigation. Alert the moment the DLQ depth exceeds zero:

YAML

condition: dlq_depth(integration="order-sync") > 0
severity: P2
notify: on-call-engineer

SLA Management

An SLA (Service Level Agreement) defines the performance commitments for an integration:

What percentage of messages must be processed successfully (e.g., 99.5%)
What is the maximum end-to-end latency (e.g., 95% of orders confirmed to warehouse within 30 seconds)
What is the availability target (e.g., 99.9% uptime over a rolling 30 days)

Setting SLA Thresholds

SLAs should reflect business requirements, not infrastructure convenience:

Ask: "What is the business impact if an order takes 5 minutes to reach the warehouse?" If the warehouse picks in batches every 30 minutes, a 5-minute latency SLA is meaningless. If the warehouse picks continuously, 30-second latency may be critical.
Agree SLAs with system owners and business stakeholders before go-live — not after an incident.

Measuring SLA Compliance

Track SLA compliance over time using your metrics platform:

Availability: percentage of time-windows where the integration was processing messages
Latency: percentage of transactions meeting the latency target (e.g., 99% under 30 seconds)
Error rate: percentage of successfully processed messages

Report SLA metrics to stakeholders weekly or monthly via a dashboard.

Integration Dashboards

Every integration in production should have a monitoring dashboard. A good integration dashboard shows:

Summary view (at-a-glance health):

Current status: green (healthy) / amber (warning) / red (critical)
Messages processed in last 24 hours
Current error rate
DLQ count

Throughput chart:

Messages per minute over time
Highlights spikes, drops, and dead periods

Latency chart:

p50, p95, p99 latency over time
Highlights degradation trends before they become incidents

Error breakdown:

Error count by error type
Which messages are in the DLQ (and why)

Recent events log:

Last 20 log entries with level, stage, and status

Tools: Azure Monitor + Workbooks, Grafana + Prometheus, Datadog, Elastic + Kibana, Splunk.

Integration Operations: Runbooks

A runbook is a documented procedure for responding to a specific operational event. Every alert should have a corresponding runbook.

Example Runbook: DLQ Messages Detected

Alert: DLQ depth > 0 on integration: order-to-warehouse
Severity: P2

Steps:
1. Read DLQ messages — note the error reason and original message ID
2. Categorise: transient failure, data quality issue, code bug, or config issue
3. Check recent deployments — was a code change deployed in the last 2 hours?
4. If transient: check whether the root cause (target system) is now healthy
5. If data quality: identify the source record causing the failure
6. Fix the root cause BEFORE resubmitting
7. Resubmit DLQ messages in batches of 10, watching for re-failure
8. Update this runbook if a new failure pattern was discovered

Example Runbook: Integration Not Processing Messages

Alert: no messages processed by order-sync in 2 hours
Severity: P1

Steps:
1. Check integration service health (pods running? logs showing errors?)
2. Check upstream system — is the source system publishing messages?
3. Check message broker — is the queue receiving messages?
4. Check network connectivity between integration service and broker
5. Check for deployment in the last 4 hours
6. Escalate to integration architect if root cause not found in 20 minutes

Common Monitoring Anti-Patterns

Monitoring only infrastructure, not business flows
CPU at 30%, memory normal, pods healthy — but no messages have processed in 3 hours. Infrastructure metrics cannot tell you whether the business process is working.

Alert fatigue
Too many low-priority alerts that fire constantly cause on-call engineers to ignore all alerts. Every alert must be actionable — if an alert does not require human response, turn it off.

Missing correlation IDs
When an incident occurs, tracing a transaction across systems is impossible without correlation IDs. Add them from day one; retrofitting is painful.

No runbooks
An alert with no runbook wakes up an on-call engineer who does not know what to do. Write runbooks before go-live.

No DLQ monitoring
A DLQ that fills silently represents lost data or unprocessed business transactions. Always alert on DLQ depth.

Lecture 5 Summary

Integration monitoring must track business flows, not just infrastructure. Infrastructure can be healthy while the business process is completely broken.
The three pillars of observability — logs, metrics, and traces — provide different views of integration health. Use all three.
Essential metrics: throughput, latency (p95/p99), error rate, queue depth, DLQ depth, and consumer lag.
Structured logs with correlation IDs enable end-to-end transaction tracing across all systems.
SLAs must be defined with business stakeholders and measured continuously.
Every alert must have a runbook. Write runbooks before go-live, not during an incident.
Common anti-patterns: monitoring only infrastructure, alert fatigue, missing correlation IDs, and no DLQ monitoring.

Next: Lecture 6 — Local and Cloud Integration Platforms

Lecture 5: System Integration Monitoring

Why Integration Monitoring Is Different

The Three Pillars of Observability

Pillar 1: Logs

Pillar 2: Metrics

Pillar 3: Traces

What to Monitor: Integration-Specific Checks

Heartbeat / Activity Monitoring

Queue Depth Alerting

Error Rate Alerting

DLQ Depth Alerting

SLA Management

Setting SLA Thresholds

Measuring SLA Compliance

Integration Dashboards

Integration Operations: Runbooks

Example Runbook: DLQ Messages Detected

Example Runbook: Integration Not Processing Messages

Common Monitoring Anti-Patterns

Lecture 5 Summary

Enjoyed this article?

Leave a comment