Lecture 7: Data Integration Strategy

Data integration is not just about moving data from one system to another — it is about making data reliable, consistent, and useful for decision-making across the organisation. A data integration strategy answers fundamental questions: What data is authoritative? How do we ensure quality? Where does data come from and where does it go? Who is responsible for it? This lecture addresses each of these questions.

Why Data Integration Strategy Matters

Without a data integration strategy, organisations accumulate a collection of data problems:

Multiple versions of truth — the same customer has different email addresses in the CRM, the ERP, and the marketing platform
Unknown lineage — nobody knows where a key metric comes from or whether to trust it
Data quality degradation — errors introduced in one system propagate to all downstream systems through integrations
Compliance risk — personal data flows across systems without documentation or consent tracking
Decision paralysis — leaders distrust their data and make decisions on gut feeling instead

A data integration strategy provides the framework to prevent these problems by defining standards, ownership, and governance for data that flows across system boundaries.

Question 1: What Is the Authoritative Source?

The most important question in data integration: for any given data element, which system is the system of record?

The system of record (or golden source) is the authoritative version of a data entity. All other systems derive or replicate their copy from it.

| Data Entity | System of Record | Other systems that hold a copy | |-------------|-----------------|--------------------------------| | Customer | CRM | ERP, Marketing Platform, Support | | Product | ERP | E-commerce, Warehouse, Analytics | | Employee | HR System | Identity Provider, Payroll, Finance | | Order | Order Management | Warehouse, Finance, Analytics, CRM |

Rules for the system of record:

Only the system of record creates or updates this entity
All other systems receive updates through integration flows from the system of record
Conflicts are resolved in favour of the system of record

When there is no clear system of record for an entity, there is no single source of truth — and the organisation will have data inconsistency problems.

Question 2: How Is Master Data Managed?

Master Data Management (MDM) is the practice of ensuring that the organisation's most critical shared data entities — customers, products, locations, employees — are accurate, consistent, and available across all systems.

The Master Data Problem

In large organisations, the same customer might exist as:

3 different records in the CRM (created by different sales reps who did not search first)
A different record in the ERP with a slightly different spelling
A different record in the support system with an older email address

Without MDM, integrations replicate this mess across every system they connect.

MDM Approaches

Registry MDM
A central registry stores a cross-reference (the "golden record") of how a customer maps across systems, without replacing data in those systems.

MDM Registry:
  Customer X → CRM ID: 12345, ERP ID: C-99, Support ID: USR-7890

Consolidation MDM
Data from multiple systems is pulled into a central MDM hub, de-duplicated, and a golden record is created. The hub is read-only — source systems keep their data.

Centralised MDM
The MDM hub is the system of record. All systems create and update master data through the hub, which distributes updates to subscribing systems.

Coexistence MDM
Both the MDM hub and source systems can update master data, with synchronisation and conflict resolution between them.

De-duplication

MDM systems use matching algorithms to identify duplicate records:

Deterministic matching — exact match on one or more fields (e.g., same email address = same person)
Probabilistic matching — fuzzy matching using weighted field comparisons (e.g., similar name + same postcode = likely same person)
Survivorship rules — when merging duplicates, which field value "survives" (e.g., use the most recently updated record, or the longest version of the name)

Question 3: How Is Data Quality Ensured?

Data quality is the fitness of data for its intended use. Poor quality data in an integration system does not stay poor — it propagates to every downstream system and use case.

Data Quality Dimensions

| Dimension | Definition | Example of failure | |-----------|-----------|---------------------| | Completeness | All required fields are present | Customer record missing email address | | Accuracy | Data correctly represents the real-world entity | Wrong phone number, outdated address | | Consistency | Same entity has the same values across systems | Customer has different birth dates in HR and CRM | | Timeliness | Data is available when needed | Stock levels updated 24 hours after actual change | | Validity | Data conforms to defined formats and rules | Date field contains "N/A" instead of a date | | Uniqueness | No unintended duplicates | Three customer records for the same person |

Data Quality in Integration Flows

Integration is both a potential source of data quality degradation and an enforcement point:

As a risk: transformations that map fields incorrectly, truncate values, or lose precision can corrupt data as it flows between systems.

As an enforcement point: validation at the integration layer can catch quality issues before they reach downstream systems:

Source data → [Validation layer] → Reject invalid messages to DLQ → Target system
                                   Alert data owner

Validation rules to implement at integration boundaries:

Required field checks — reject messages missing mandatory fields
Format validation — dates in correct format, numeric fields are numeric, email addresses are valid
Business rule validation — product codes exist in the product catalogue, customer IDs are known
Referential integrity — order items reference existing products

Data Quality Monitoring

Track data quality metrics over time:

Validation failure rate per source system
Most common validation error types
Records rejected to DLQ per day
Time to resolve data quality incidents

Question 4: How Is Data Lineage Tracked?

Data lineage documents where data comes from, how it was transformed, and where it went. It answers:

"This report shows X — where does X come from?"
"We changed System A — which downstream reports will be affected?"
"A GDPR deletion request came in for Customer Y — in which systems does their data exist?"

Why Lineage Matters

Regulatory compliance: GDPR requires knowing where personal data flows. Without lineage, you cannot answer a data subject access request or erasure request reliably.

Impact analysis: before changing a data source or transformation, you need to know what depends on it. Lineage maps these dependencies.

Trust: data consumers — analysts, decision-makers — need to trust their data. Documented lineage lets them trace a metric back to its source and verify its integrity.

Debugging: when data looks wrong in a report, lineage tells you where to look for the root cause.

Capturing Lineage

Manual lineage documentation: maintain a data catalogue or integration registry that records for each integration:

Source system and data entity
Transformations applied
Target system and data entity

Automated lineage: modern data platforms (Azure Purview, AWS Glue Data Catalog, Apache Atlas) can automatically capture lineage as data moves through integration pipelines.

Question 5: What Data Exchange Formats Are Used?

The choice of data format affects interoperability, performance, schema evolution, and tooling.

Common Data Exchange Formats

CSV (Comma-Separated Values)
Simple, universal, human-readable. No schema enforcement. Good for bulk data transfer and reporting exports. Fragile — any variation in delimiter, quoting, or encoding can cause parsing failures.

JSON (JavaScript Object Notation)
Dominant format for REST APIs and modern integration. Human-readable, flexible. No built-in schema enforcement (use JSON Schema for validation). Verbose compared to binary formats.

XML (Extensible Markup Language)
The older standard, still dominant in SOAP web services, EDI, healthcare (HL7), and financial systems (FIX, SWIFT). Verbose but schema enforcement via XSD is mature and widely supported.

Apache Avro
Binary format with a schema registry. Compact, fast, excellent schema evolution. Standard in Kafka-based data pipelines.

Protocol Buffers (Protobuf)
Google's binary serialisation format. Very compact, very fast, strongly typed. Used with gRPC.

EDI (Electronic Data Interchange)
Legacy standard for B2B data exchange. ANSI X12 (North America) and EDIFACT (international) are common flavours. Still dominant in retail, logistics, and healthcare B2B integration.

Parquet / ORC
Columnar storage formats for analytical data (data lakes, data warehouses). Not used for transactional integration — used for bulk analytical loads.

Choosing a Format

| Scenario | Recommended Format | |----------|--------------------| | REST API | JSON | | High-throughput Kafka pipeline | Avro | | gRPC service | Protobuf | | SOAP web service | XML | | B2B EDI | ANSI X12 or EDIFACT | | Legacy batch file | CSV or XML | | Data warehouse bulk load | Parquet |

Question 6: What Affects Protocol Choice?

The protocol is the transport mechanism — the "how" of data delivery, not the "what."

Common Integration Protocols

| Protocol | Characteristics | Common uses | |----------|----------------|-------------| | HTTP/S | Synchronous, stateless, universal | REST APIs, webhooks | | AMQP | Async messaging, reliable | RabbitMQ, Azure Service Bus | | MQTT | Lightweight, low-bandwidth | IoT devices, sensors | | SFTP | File transfer over SSH | Batch file exchange, B2B | | FTP/S | File transfer | Legacy batch; avoid where possible | | JMS | Java messaging standard | Java-based enterprise systems | | Kafka protocol | High-throughput streaming | Kafka clients | | AS2 | Secure B2B messaging | EDI over internet (AS2 + EDI) |

Protocol selection factors:

Reliability required? → AMQP or Kafka (not HTTP without retry logic)
Real-time? → HTTP or AMQP
Legacy system constraint? → check what the system supports
B2B partner? → check their supported protocols (often AS2, SFTP)
Volume? → Kafka for millions/sec; HTTP for hundreds/sec
Security requirement? → SFTP, AS2, or TLS-wrapped

Question 7: How Is Data Governance Applied to Integration?

Data governance defines the policies, standards, and accountabilities for managing data as a strategic asset.

In the context of integration, governance covers:

Data ownership: who is accountable for the quality and completeness of each data entity?

Access control: which systems are permitted to access which data? Integration flows must not move data to systems without authorisation.

Data classification: what is the sensitivity of the data (public, internal, confidential, restricted)? Classification determines encryption, access, and retention requirements.

Retention and deletion: how long can data be retained in integration queues and logs? When data is deleted in the system of record, it must also be deleted from all systems that received it through integration.

Audit trail: every data movement through an integration must be logged for compliance purposes.

Integrated Data Sources in Decision-Making

One of the course's core themes: integrated data sources have an important role in the decision-making process.

From Data Silos to Data Products

When data exists only in individual systems, decisions are made with incomplete information. When data is integrated:

Finance can see customer purchase data alongside payment history
Marketing can see support tickets alongside campaign engagement
Operations can see real-time inventory alongside demand forecasts

Integration turns siloed data into data products — curated, trustworthy datasets that decision-makers can rely on.

The Data Warehouse and Data Lake

A data warehouse consolidates data from operational systems into a structure optimised for reporting and analysis. Integration (ETL/ELT) is what feeds it.

A data lake stores raw data from all sources in its native format, allowing flexible analysis. Integration pipelines (often event-driven or CDC-based) keep the lake current.

The quality of decision-making analytics is directly limited by the quality of the integration that feeds the warehouse or lake.

Lecture 7 Summary

Every data entity needs a defined system of record. Ambiguity leads to inconsistency and conflict.
MDM ensures that critical shared entities (customers, products, employees) are accurate and consistent across all systems. Choose the MDM approach (registry, consolidation, centralised, coexistence) based on your organisational constraints.
Data quality must be enforced at integration boundaries — validate, reject bad data with clear error messages, and monitor quality metrics over time.
Data lineage documents where data comes from and where it goes — essential for GDPR compliance, impact analysis, and building trust in analytics.
Format choice (JSON, XML, Avro, EDI) and protocol choice (HTTP, AMQP, SFTP, Kafka) should be driven by reliability, performance, legacy system constraints, and partner requirements.
Data governance defines the policies for ownership, access, classification, retention, and auditing of data as it flows through integrations.

Next: Lecture 8 — Enterprise Integration Patterns