Module 7: Integration Maintenance and Optimisation

Building an integration is not the end of the work — it is the beginning of a long operational relationship. Integration solutions must be monitored, maintained, and optimised over their entire lifetime. Systems change, data volumes grow, and failure modes emerge that were not anticipated during design. This module equips you to keep integrations healthy, diagnose problems quickly, and evolve the integration architecture as business and technical requirements change.

Monitoring and Troubleshooting Integration Processes

The Integration Operations Mindset

An integration that nobody is watching is an integration that will fail silently. The first principle of integration operations is: assume things will go wrong, and instrument everything so you know immediately when they do.

Key Metrics to Monitor

Health metrics (detect that something is wrong):

| Metric | What it indicates | Alert threshold | |---|---|---| | Integration heartbeat | Is the integration running? | No activity within expected window | | Message throughput | Are messages flowing? | > 30% deviation from baseline | | Error rate | How many messages are failing? | > 1% error rate | | DLQ depth | Are unprocessable messages accumulating? | Any message in DLQ | | Processing latency | Is the integration slowing down? | > 2× baseline | | Queue depth | Are consumers keeping up with producers? | Sustained growth over 10 minutes |

Capacity metrics (detect that you are approaching a limit):

| Metric | Alert threshold | |---|---| | CPU utilisation of integration service | > 80% sustained | | Memory utilisation | > 85% | | Database connection pool utilisation | > 70% | | Message broker disk usage | > 70% | | Network bandwidth | > 60% of provisioned capacity |

Structured Logging

Structured logging is the foundation of effective integration observability. Every integration log entry should be a machine-readable record, not a prose sentence.

Example of useful structured log:

JSON

{
  "timestamp": "2026-04-18T10:30:05.123Z",
  "level": "ERROR",
  "integration": "order-to-warehouse",
  "correlationId": "abc-123-xyz",
  "messageId": "msg-9876",
  "sourceSystem": "ERP",
  "targetSystem": "WMS",
  "stage": "transformation",
  "error": "Missing required field: warehouseCode",
  "payload": { "orderId": "ORD-1234" }
}

Consistent fields across all integrations:

correlationId — links all log entries for a single business transaction across all systems
messageId — the unique ID of the specific message being processed
integration — which integration flow generated this log entry
stage — where in the flow (receive, transform, route, send, acknowledge)
status — success, failure, retry, dead-lettered

Distributed Tracing

For complex multi-step integration flows (especially event-driven choreography), distributed tracing provides end-to-end visibility:

A trace spans the entire journey of a business transaction across systems
Each step (publish, consume, transform, forward) is a span in the trace
Trace IDs are propagated in message headers across all systems

Tools: OpenTelemetry (vendor-neutral), Azure Application Insights, AWS X-Ray, Jaeger, Zipkin.

Troubleshooting Methodology

When an integration issue is reported, follow a systematic approach:

Step 1: Establish scope

Is this one message or all messages failing?
Is this one integration or all integrations?
Did anything change recently (deployment, config change, upstream system change)?

Step 2: Trace the message

Use the correlation ID to find all log entries for the affected transaction
Identify the step where the message stopped or was rejected
Check whether the message is in the DLQ

Step 3: Diagnose the cause

Transient failure (network timeout, service temporarily unavailable) → check retry logs, resolve if retries did not fix it
Data quality failure (invalid field, missing required data) → identify source of bad data, fix upstream if possible
Configuration failure (wrong endpoint URL, expired credentials) → fix configuration, redeploy
Code bug (transformation logic error, incorrect routing rule) → fix code, deploy fix, resubmit DLQ messages

Step 4: Recover

Fix the root cause before resubmitting DLQ messages — resubmitting without fixing will refill the DLQ
Resubmit DLQ messages in batches, monitoring for re-failure
Verify the full flow completes correctly after resubmission

Step 5: Prevent recurrence

Add a test that would have caught the bug
Add a monitoring rule that would have alerted earlier
Document the incident and resolution

Performance Tuning and Optimisation Techniques

Identify Bottlenecks First

Do not optimise blindly. Use profiling and monitoring data to identify the actual bottleneck before changing anything:

Message broker throughput — is the broker itself the limit? (rare for modern managed services)
Consumer processing time — is each message taking too long to process?
Database or API latency — is the integration waiting on a downstream call?
Serialisation/deserialisation — is JSON parsing or schema validation adding meaningful overhead?
Network latency — is cross-region or cross-datacenter communication adding latency?

Optimisation Techniques

Increase consumer parallelism
Add more consumer instances to process messages in parallel. This is the most impactful lever for throughput improvement.

Requirements for safe parallelism:

Consumers must be idempotent — processing a message twice must not cause errors or data corruption
If message ordering matters, use partitioned topics (Kafka partitions, Service Bus sessions) to ensure ordered processing within each partition while allowing parallel processing across partitions

Batching
Instead of processing one message at a time, consume and process messages in batches:

Reduces per-message overhead (fewer roundtrips to the broker, fewer database transactions)
Particularly effective for bulk database writes (batch insert instead of insert-per-message)
Trade-off: increases latency for individual messages; increases impact if a batch fails

Connection pooling
Integration services that make database or HTTP calls must use connection pooling:

Establish a pool of connections at startup; reuse them for each message
Never open and close a connection per message — the overhead is significant at scale

Caching
Integration flows often perform lookups against reference data (currency codes, product codes, customer classifications). Cache these lookups:

Use in-process caching for frequently accessed, slowly changing reference data
Refresh the cache on a schedule (not per message)
Be aware that a stale cache can produce incorrect transformation results

Async patterns for slow downstream systems
If the target system is slow, do not block the consumer waiting for it:

Write to a queue and let a separate consumer call the slow system at its own pace
Use the request-reply pattern with a callback when the result is eventually available

Compression
For high-volume messaging with large payloads, enable message compression (gzip, LZ4, Snappy). Most message brokers support this natively. Effective for text-based payloads (JSON, XML); minimal benefit for already-compressed binary formats.

Managing and Resolving Integration Errors

Error Classification and Response Matrix

| Error Class | Examples | Immediate Action | Resolution Owner | |---|---|---|---| | Transient | Network timeout, 503 | Retry (exponential backoff) | Automatic | | Data quality | Missing field, invalid format | DLQ, alert data owner | Source system team | | Business rule | Duplicate key, unknown reference | DLQ, alert business | Business / data steward | | Configuration | Wrong endpoint, expired cert | Alert ops, pause flow | Operations team | | Code defect | NullPointerException in transform | DLQ, alert dev team | Development team | | Capacity | Consumer too slow, queue growing | Scale out, alert ops | Operations / architecture |

SLA Management for Integration Errors

Define and agree SLAs for error resolution with the business:

| Severity | Definition | Response Time | Resolution Time | |---|---|---|---| | P1 | Integration completely down; business process halted | 15 minutes | 4 hours | | P2 | Integration degraded; some transactions failing | 1 hour | 8 hours | | P3 | Non-critical errors; DLQ messages requiring manual review | 4 hours | 2 business days | | P4 | Cosmetic / low impact | Next business day | Next sprint |

DLQ Review Process

Establish a regular DLQ review cadence (at minimum, weekly for non-production; daily or real-time alerting for production):

Categorise messages by error type
Identify root cause for each category
Fix root cause before resubmitting
Resubmit messages in controlled batches
Monitor post-resubmission to confirm success
Update runbook with new failure patterns and resolutions

Upgrading and Scaling Integration Solutions

Scaling Patterns

Vertical scaling — increase CPU and memory of integration service instances. Simple but limited; reaches a ceiling.

Horizontal scaling — add more instances of the integration service. More impactful; requires stateless or shared-state design.

Partitioned consumers — for ordered processing, partition the message stream by a key (customer ID, order ID) and assign each partition to a dedicated consumer instance.

Autoscaling — configure cloud platforms to scale consumer instances automatically based on queue depth or CPU utilisation. Azure Container Apps, Kubernetes KEDA, and AWS ECS with queue-based autoscaling support this natively.

Upgrading Integration Components

Integration upgrades require careful coordination because multiple systems depend on each component:

Message broker upgrades:

Test the upgrade in a lower environment first
Check for breaking API changes in the broker SDK
Schedule during a low-traffic window
Plan for a rollback if the upgrade causes issues

Platform / middleware upgrades:

Review release notes for deprecations that affect your integrations
Update SDK dependencies before upgrading the platform
Run the full integration test suite after upgrade

Schema migrations:

Follow the versioning strategy from Module 6
Deploy the new schema version alongside the old version
Migrate consumers before retiring the old version

Maintaining a Healthy Integration Architecture

Technical Debt in Integration

Integration technical debt accumulates faster than most teams expect:

Undocumented integrations added without going through the governance process
Hard-coded values that should be configuration
Transformation logic that has grown far beyond its original design
Point-to-point connections added to avoid the ESB
Deprecated integrations still running because nobody is confident enough to remove them

Quarterly integration health review:

Review the integration registry for accuracy — are all live integrations listed? Are deprecated ones removed?
Check for unused or underused integrations (no messages in 90 days)
Review error rates and DLQ trends — are any integrations chronically failing?
Review performance metrics — are any integrations approaching capacity limits?
Review security: are all credentials rotated? Are all certificates valid for the next 90 days?

Integration Roadmap Maintenance

Keep a rolling 12-month integration roadmap that captures:

Planned new integrations (with business case and timeline)
Planned upgrades (platform versions, schema migrations)
Planned deprecations (what will be retired and when)
Capacity investments (scaling plans based on projected growth)

The roadmap is reviewed and updated quarterly in alignment with the business and technology roadmap.

Documentation as a First-Class Concern

Integration documentation degrades faster than application documentation because:

Integration flows are more complex and span more systems
System owners change, and new owners do not inherit knowledge of integration dependencies
Undocumented integrations are the ones most likely to be accidentally broken

Maintain as living documentation:

The integration registry (always current)
Integration Design Documents (updated whenever a change is made)
Runbooks for common failure scenarios
Architecture diagrams (updated at minimum quarterly)

Module Summary

Monitor availability, throughput, latency, error rate, and DLQ depth for every integration. Alert on deviations — do not wait for users to report failures.
Structured logging with consistent fields (correlationId, messageId, stage, status) is the foundation of effective troubleshooting. Distributed tracing gives end-to-end visibility for complex flows.
Optimise based on measured bottlenecks: increase consumer parallelism, enable batching, use connection pooling, cache reference data, and compress large payloads.
Classify errors by root cause (transient, data quality, configuration, code defect, capacity) and respond accordingly. Establish P1–P4 SLAs and a DLQ review cadence.
Maintain the integration architecture actively: quarterly health reviews, a rolling roadmap, up-to-date documentation, and proactive credential and certificate rotation.

You have completed the Integration Architect Training. Return to the overview to review the capstone project brief and design your end-to-end integration architecture.