Lecture 5: System Integration Monitoring
Learn how to monitor live integration systems: observability pillars, key metrics, structured logging, alerting strategies, SLA management, DLQ operations, and building dashboards that make integration health visible.
An integration that nobody is watching is an integration that will eventually fail silently. Messages stop flowing, data falls out of sync, and business processes stall — all before anyone notices. Monitoring is not optional; it is a core responsibility of anyone who runs integration systems in production. This lecture covers how to build the monitoring capability that lets you know about problems before users do.
Why Integration Monitoring Is Different
Monitoring application servers is relatively straightforward — if the server is up and responding to health checks, it is working. Integration monitoring is harder:
- Failures can be silent. An integration might be running but processing zero messages because the upstream system stopped sending. The integration infrastructure is healthy; the business process is broken.
- Failures can be partial. 95% of messages process correctly; 5% fail silently and go nowhere. From the infrastructure perspective, the integration is running fine.
- Failures span systems. A message published by System A might fail in the integration layer, or succeed there but fail in System B. Tracking the full journey requires correlation across multiple logs.
- Time matters. A 10-minute delay in a file transfer might be acceptable. A 10-minute delay in a payment confirmation is a business incident.
The Three Pillars of Observability
Pillar 1: Logs
Logs record events that happened. For integrations, every meaningful step of every transaction should produce a log entry.
What makes a good integration log entry:
{
"timestamp": "2026-04-18T10:30:05.123Z",
"level": "INFO",
"integration": "order-to-warehouse",
"correlationId": "user-req-abc-123",
"messageId": "msg-9876-xyz",
"sourceSystem": "ERP",
"targetSystem": "WMS",
"stage": "send",
"status": "success",
"durationMs": 142
}Key fields:
correlationId— the same ID used across all systems for one business transaction. This is how you trace a transaction end-to-end across multiple log sources.messageId— unique ID of this specific messageintegration— which flow produced this logstage— where in the flow (receive, validate, transform, route, send, acknowledge)status— success, failure, retry, dead-lettereddurationMs— how long this step took
Structured logging (JSON format) is critical. It allows log aggregation tools (Elasticsearch, Azure Monitor, Splunk) to parse and query across fields — impossible with free-text log strings.
Pillar 2: Metrics
Metrics are numerical measurements over time. Unlike logs (one entry per event), metrics are aggregated at regular intervals.
Essential integration metrics:
| Metric | Description | Unit | |--------|-------------|------| | Message throughput | Messages processed per time period | messages/second or messages/hour | | Processing latency | Time from message receipt to completion | milliseconds (p50, p95, p99) | | Error rate | Percentage of messages that failed | % | | Queue depth | Number of messages waiting to be processed | count | | DLQ depth | Number of messages in the dead letter queue | count | | Consumer lag (Kafka) | How far behind consumers are | message count or seconds | | Integration availability | Is the integration processing messages? | boolean / % uptime |
Percentiles matter for latency: the average latency is misleading. The p95 (95th percentile) tells you what 95% of your users experience. The p99 tells you about the worst 1%.
Pillar 3: Traces
A distributed trace tracks the full journey of a single business transaction across all systems and services. Each step is a span with a start time, duration, and success/failure status.
trace: order-1234 (total: 342ms)
├── span: ERP publish 12ms ✓
├── span: Integration receive 3ms ✓
├── span: Transform 28ms ✓
├── span: Validate 5ms ✓
├── span: WMS send 290ms ✓
└── span: Acknowledge 4ms ✓Traces are the most powerful diagnostic tool for finding bottlenecks and pinpointing exactly where a transaction failed.
Implementation: propagate a trace ID in message headers across all systems. Use OpenTelemetry — a vendor-neutral tracing standard supported by all major cloud platforms.
What to Monitor: Integration-Specific Checks
Beyond the standard infrastructure metrics (CPU, memory, disk), integrations need domain-specific monitoring:
Heartbeat / Activity Monitoring
The most important check: is the integration actually processing messages?
Configure a monitor that checks:
- Timestamp of the last successful message processed
- Number of messages processed in the last hour
If a scheduled daily integration has not run by its expected time, alert immediately — even if all infrastructure metrics look healthy.
# Alert: integration has been idle for too long
condition: time_since_last_message(integration="order-sync") > 2h
severity: P2
notify: operations-teamQueue Depth Alerting
A growing queue depth means consumers are falling behind producers. This is an early warning of:
- Consumer performance degradation
- Consumer crash
- Downstream system slowdown or failure
- Traffic spike
Alert at two thresholds:
- Warning: queue depth growing steadily for 10 minutes
- Critical: queue depth exceeds a configured maximum
Error Rate Alerting
A normal integration has a near-zero error rate for business-rule errors. A spike indicates:
- Data quality degradation in the source system
- Schema change in the source that was not communicated
- Bug introduced in a recent deployment
# Alert: error rate spike
condition: error_rate(integration="order-sync", window=5m) > 1%
severity: P2
notify: dev-teamDLQ Depth Alerting
Any message in a Dead Letter Queue requires investigation. Alert the moment the DLQ depth exceeds zero:
condition: dlq_depth(integration="order-sync") > 0
severity: P2
notify: on-call-engineerSLA Management
An SLA (Service Level Agreement) defines the performance commitments for an integration:
- What percentage of messages must be processed successfully (e.g., 99.5%)
- What is the maximum end-to-end latency (e.g., 95% of orders confirmed to warehouse within 30 seconds)
- What is the availability target (e.g., 99.9% uptime over a rolling 30 days)
Setting SLA Thresholds
SLAs should reflect business requirements, not infrastructure convenience:
- Ask: "What is the business impact if an order takes 5 minutes to reach the warehouse?" If the warehouse picks in batches every 30 minutes, a 5-minute latency SLA is meaningless. If the warehouse picks continuously, 30-second latency may be critical.
- Agree SLAs with system owners and business stakeholders before go-live — not after an incident.
Measuring SLA Compliance
Track SLA compliance over time using your metrics platform:
- Availability: percentage of time-windows where the integration was processing messages
- Latency: percentage of transactions meeting the latency target (e.g., 99% under 30 seconds)
- Error rate: percentage of successfully processed messages
Report SLA metrics to stakeholders weekly or monthly via a dashboard.
Integration Dashboards
Every integration in production should have a monitoring dashboard. A good integration dashboard shows:
Summary view (at-a-glance health):
- Current status: green (healthy) / amber (warning) / red (critical)
- Messages processed in last 24 hours
- Current error rate
- DLQ count
Throughput chart:
- Messages per minute over time
- Highlights spikes, drops, and dead periods
Latency chart:
- p50, p95, p99 latency over time
- Highlights degradation trends before they become incidents
Error breakdown:
- Error count by error type
- Which messages are in the DLQ (and why)
Recent events log:
- Last 20 log entries with level, stage, and status
Tools: Azure Monitor + Workbooks, Grafana + Prometheus, Datadog, Elastic + Kibana, Splunk.
Integration Operations: Runbooks
A runbook is a documented procedure for responding to a specific operational event. Every alert should have a corresponding runbook.
Example Runbook: DLQ Messages Detected
Alert: DLQ depth > 0 on integration: order-to-warehouse
Severity: P2
Steps:
1. Read DLQ messages — note the error reason and original message ID
2. Categorise: transient failure, data quality issue, code bug, or config issue
3. Check recent deployments — was a code change deployed in the last 2 hours?
4. If transient: check whether the root cause (target system) is now healthy
5. If data quality: identify the source record causing the failure
6. Fix the root cause BEFORE resubmitting
7. Resubmit DLQ messages in batches of 10, watching for re-failure
8. Update this runbook if a new failure pattern was discoveredExample Runbook: Integration Not Processing Messages
Alert: no messages processed by order-sync in 2 hours
Severity: P1
Steps:
1. Check integration service health (pods running? logs showing errors?)
2. Check upstream system — is the source system publishing messages?
3. Check message broker — is the queue receiving messages?
4. Check network connectivity between integration service and broker
5. Check for deployment in the last 4 hours
6. Escalate to integration architect if root cause not found in 20 minutesCommon Monitoring Anti-Patterns
Monitoring only infrastructure, not business flows
CPU at 30%, memory normal, pods healthy — but no messages have processed in 3 hours. Infrastructure metrics cannot tell you whether the business process is working.
Alert fatigue
Too many low-priority alerts that fire constantly cause on-call engineers to ignore all alerts. Every alert must be actionable — if an alert does not require human response, turn it off.
Missing correlation IDs
When an incident occurs, tracing a transaction across systems is impossible without correlation IDs. Add them from day one; retrofitting is painful.
No runbooks
An alert with no runbook wakes up an on-call engineer who does not know what to do. Write runbooks before go-live.
No DLQ monitoring
A DLQ that fills silently represents lost data or unprocessed business transactions. Always alert on DLQ depth.
Lecture 5 Summary
- Integration monitoring must track business flows, not just infrastructure. Infrastructure can be healthy while the business process is completely broken.
- The three pillars of observability — logs, metrics, and traces — provide different views of integration health. Use all three.
- Essential metrics: throughput, latency (p95/p99), error rate, queue depth, DLQ depth, and consumer lag.
- Structured logs with correlation IDs enable end-to-end transaction tracing across all systems.
- SLAs must be defined with business stakeholders and measured continuously.
- Every alert must have a runbook. Write runbooks before go-live, not during an incident.
- Common anti-patterns: monitoring only infrastructure, alert fatigue, missing correlation IDs, and no DLQ monitoring.
Next: Lecture 6 — Local and Cloud Integration Platforms
Enjoyed this article?
Explore the Integration Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.