Module 7: Integration Maintenance and Optimisation
Monitor and troubleshoot live integration processes, apply performance tuning and optimisation techniques, manage and resolve integration errors, scale integration solutions, and maintain a healthy long-term integration architecture.
Building an integration is not the end of the work — it is the beginning of a long operational relationship. Integration solutions must be monitored, maintained, and optimised over their entire lifetime. Systems change, data volumes grow, and failure modes emerge that were not anticipated during design. This module equips you to keep integrations healthy, diagnose problems quickly, and evolve the integration architecture as business and technical requirements change.
Monitoring and Troubleshooting Integration Processes
The Integration Operations Mindset
An integration that nobody is watching is an integration that will fail silently. The first principle of integration operations is: assume things will go wrong, and instrument everything so you know immediately when they do.
Key Metrics to Monitor
Health metrics (detect that something is wrong):
| Metric | What it indicates | Alert threshold | |---|---|---| | Integration heartbeat | Is the integration running? | No activity within expected window | | Message throughput | Are messages flowing? | > 30% deviation from baseline | | Error rate | How many messages are failing? | > 1% error rate | | DLQ depth | Are unprocessable messages accumulating? | Any message in DLQ | | Processing latency | Is the integration slowing down? | > 2× baseline | | Queue depth | Are consumers keeping up with producers? | Sustained growth over 10 minutes |
Capacity metrics (detect that you are approaching a limit):
| Metric | Alert threshold | |---|---| | CPU utilisation of integration service | > 80% sustained | | Memory utilisation | > 85% | | Database connection pool utilisation | > 70% | | Message broker disk usage | > 70% | | Network bandwidth | > 60% of provisioned capacity |
Structured Logging
Structured logging is the foundation of effective integration observability. Every integration log entry should be a machine-readable record, not a prose sentence.
Example of useful structured log:
{
"timestamp": "2026-04-18T10:30:05.123Z",
"level": "ERROR",
"integration": "order-to-warehouse",
"correlationId": "abc-123-xyz",
"messageId": "msg-9876",
"sourceSystem": "ERP",
"targetSystem": "WMS",
"stage": "transformation",
"error": "Missing required field: warehouseCode",
"payload": { "orderId": "ORD-1234" }
}Consistent fields across all integrations:
correlationId— links all log entries for a single business transaction across all systemsmessageId— the unique ID of the specific message being processedintegration— which integration flow generated this log entrystage— where in the flow (receive, transform, route, send, acknowledge)status— success, failure, retry, dead-lettered
Distributed Tracing
For complex multi-step integration flows (especially event-driven choreography), distributed tracing provides end-to-end visibility:
- A trace spans the entire journey of a business transaction across systems
- Each step (publish, consume, transform, forward) is a span in the trace
- Trace IDs are propagated in message headers across all systems
Tools: OpenTelemetry (vendor-neutral), Azure Application Insights, AWS X-Ray, Jaeger, Zipkin.
Troubleshooting Methodology
When an integration issue is reported, follow a systematic approach:
Step 1: Establish scope
- Is this one message or all messages failing?
- Is this one integration or all integrations?
- Did anything change recently (deployment, config change, upstream system change)?
Step 2: Trace the message
- Use the correlation ID to find all log entries for the affected transaction
- Identify the step where the message stopped or was rejected
- Check whether the message is in the DLQ
Step 3: Diagnose the cause
- Transient failure (network timeout, service temporarily unavailable) → check retry logs, resolve if retries did not fix it
- Data quality failure (invalid field, missing required data) → identify source of bad data, fix upstream if possible
- Configuration failure (wrong endpoint URL, expired credentials) → fix configuration, redeploy
- Code bug (transformation logic error, incorrect routing rule) → fix code, deploy fix, resubmit DLQ messages
Step 4: Recover
- Fix the root cause before resubmitting DLQ messages — resubmitting without fixing will refill the DLQ
- Resubmit DLQ messages in batches, monitoring for re-failure
- Verify the full flow completes correctly after resubmission
Step 5: Prevent recurrence
- Add a test that would have caught the bug
- Add a monitoring rule that would have alerted earlier
- Document the incident and resolution
Performance Tuning and Optimisation Techniques
Identify Bottlenecks First
Do not optimise blindly. Use profiling and monitoring data to identify the actual bottleneck before changing anything:
- Message broker throughput — is the broker itself the limit? (rare for modern managed services)
- Consumer processing time — is each message taking too long to process?
- Database or API latency — is the integration waiting on a downstream call?
- Serialisation/deserialisation — is JSON parsing or schema validation adding meaningful overhead?
- Network latency — is cross-region or cross-datacenter communication adding latency?
Optimisation Techniques
Increase consumer parallelism
Add more consumer instances to process messages in parallel. This is the most impactful lever for throughput improvement.
Requirements for safe parallelism:
- Consumers must be idempotent — processing a message twice must not cause errors or data corruption
- If message ordering matters, use partitioned topics (Kafka partitions, Service Bus sessions) to ensure ordered processing within each partition while allowing parallel processing across partitions
Batching
Instead of processing one message at a time, consume and process messages in batches:
- Reduces per-message overhead (fewer roundtrips to the broker, fewer database transactions)
- Particularly effective for bulk database writes (batch insert instead of insert-per-message)
- Trade-off: increases latency for individual messages; increases impact if a batch fails
Connection pooling
Integration services that make database or HTTP calls must use connection pooling:
- Establish a pool of connections at startup; reuse them for each message
- Never open and close a connection per message — the overhead is significant at scale
Caching
Integration flows often perform lookups against reference data (currency codes, product codes, customer classifications). Cache these lookups:
- Use in-process caching for frequently accessed, slowly changing reference data
- Refresh the cache on a schedule (not per message)
- Be aware that a stale cache can produce incorrect transformation results
Async patterns for slow downstream systems
If the target system is slow, do not block the consumer waiting for it:
- Write to a queue and let a separate consumer call the slow system at its own pace
- Use the request-reply pattern with a callback when the result is eventually available
Compression
For high-volume messaging with large payloads, enable message compression (gzip, LZ4, Snappy). Most message brokers support this natively. Effective for text-based payloads (JSON, XML); minimal benefit for already-compressed binary formats.
Managing and Resolving Integration Errors
Error Classification and Response Matrix
| Error Class | Examples | Immediate Action | Resolution Owner | |---|---|---|---| | Transient | Network timeout, 503 | Retry (exponential backoff) | Automatic | | Data quality | Missing field, invalid format | DLQ, alert data owner | Source system team | | Business rule | Duplicate key, unknown reference | DLQ, alert business | Business / data steward | | Configuration | Wrong endpoint, expired cert | Alert ops, pause flow | Operations team | | Code defect | NullPointerException in transform | DLQ, alert dev team | Development team | | Capacity | Consumer too slow, queue growing | Scale out, alert ops | Operations / architecture |
SLA Management for Integration Errors
Define and agree SLAs for error resolution with the business:
| Severity | Definition | Response Time | Resolution Time | |---|---|---|---| | P1 | Integration completely down; business process halted | 15 minutes | 4 hours | | P2 | Integration degraded; some transactions failing | 1 hour | 8 hours | | P3 | Non-critical errors; DLQ messages requiring manual review | 4 hours | 2 business days | | P4 | Cosmetic / low impact | Next business day | Next sprint |
DLQ Review Process
Establish a regular DLQ review cadence (at minimum, weekly for non-production; daily or real-time alerting for production):
- Categorise messages by error type
- Identify root cause for each category
- Fix root cause before resubmitting
- Resubmit messages in controlled batches
- Monitor post-resubmission to confirm success
- Update runbook with new failure patterns and resolutions
Upgrading and Scaling Integration Solutions
Scaling Patterns
Vertical scaling — increase CPU and memory of integration service instances. Simple but limited; reaches a ceiling.
Horizontal scaling — add more instances of the integration service. More impactful; requires stateless or shared-state design.
Partitioned consumers — for ordered processing, partition the message stream by a key (customer ID, order ID) and assign each partition to a dedicated consumer instance.
Autoscaling — configure cloud platforms to scale consumer instances automatically based on queue depth or CPU utilisation. Azure Container Apps, Kubernetes KEDA, and AWS ECS with queue-based autoscaling support this natively.
Upgrading Integration Components
Integration upgrades require careful coordination because multiple systems depend on each component:
Message broker upgrades:
- Test the upgrade in a lower environment first
- Check for breaking API changes in the broker SDK
- Schedule during a low-traffic window
- Plan for a rollback if the upgrade causes issues
Platform / middleware upgrades:
- Review release notes for deprecations that affect your integrations
- Update SDK dependencies before upgrading the platform
- Run the full integration test suite after upgrade
Schema migrations:
- Follow the versioning strategy from Module 6
- Deploy the new schema version alongside the old version
- Migrate consumers before retiring the old version
Maintaining a Healthy Integration Architecture
Technical Debt in Integration
Integration technical debt accumulates faster than most teams expect:
- Undocumented integrations added without going through the governance process
- Hard-coded values that should be configuration
- Transformation logic that has grown far beyond its original design
- Point-to-point connections added to avoid the ESB
- Deprecated integrations still running because nobody is confident enough to remove them
Quarterly integration health review:
- Review the integration registry for accuracy — are all live integrations listed? Are deprecated ones removed?
- Check for unused or underused integrations (no messages in 90 days)
- Review error rates and DLQ trends — are any integrations chronically failing?
- Review performance metrics — are any integrations approaching capacity limits?
- Review security: are all credentials rotated? Are all certificates valid for the next 90 days?
Integration Roadmap Maintenance
Keep a rolling 12-month integration roadmap that captures:
- Planned new integrations (with business case and timeline)
- Planned upgrades (platform versions, schema migrations)
- Planned deprecations (what will be retired and when)
- Capacity investments (scaling plans based on projected growth)
The roadmap is reviewed and updated quarterly in alignment with the business and technology roadmap.
Documentation as a First-Class Concern
Integration documentation degrades faster than application documentation because:
- Integration flows are more complex and span more systems
- System owners change, and new owners do not inherit knowledge of integration dependencies
- Undocumented integrations are the ones most likely to be accidentally broken
Maintain as living documentation:
- The integration registry (always current)
- Integration Design Documents (updated whenever a change is made)
- Runbooks for common failure scenarios
- Architecture diagrams (updated at minimum quarterly)
Module Summary
- Monitor availability, throughput, latency, error rate, and DLQ depth for every integration. Alert on deviations — do not wait for users to report failures.
- Structured logging with consistent fields (correlationId, messageId, stage, status) is the foundation of effective troubleshooting. Distributed tracing gives end-to-end visibility for complex flows.
- Optimise based on measured bottlenecks: increase consumer parallelism, enable batching, use connection pooling, cache reference data, and compress large payloads.
- Classify errors by root cause (transient, data quality, configuration, code defect, capacity) and respond accordingly. Establish P1–P4 SLAs and a DLQ review cadence.
- Maintain the integration architecture actively: quarterly health reviews, a rolling roadmap, up-to-date documentation, and proactive credential and certificate rotation.
You have completed the Integration Architect Training. Return to the overview to review the capstone project brief and design your end-to-end integration architecture.
Enjoyed this article?
Explore the Integration Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.