Data Modelling · Lesson 3 of 3
Data Vault 2.0: Hubs, Links, Satellites, and Loading
What Is Data Vault?
Data Vault is a modelling methodology designed for enterprise data warehouses that must:
- Integrate data from many sources (10, 50, 100+ systems)
- Handle schema changes without rebuilding pipelines
- Maintain a full audit trail of every record
- Load data in parallel without dependencies between tables
- Separate raw data storage from business logic
Invented by Dan Linstedt, the current version is Data Vault 2.0 (DV2.0).
1. Kimball vs Data Vault
| | Kimball (Dimensional) | Data Vault | |--|----------------------|-----------| | Design for | BI / analytics | Enterprise integration | | Sources | Few (3-10) | Many (10-100+) | | History | SCDs (Type 2) | Automatic — all data is append-only | | Schema changes | Hard (rebuild dims) | Easy (add satellites) | | Query complexity | Simple (star joins) | Complex (hub+sat joins) | | Loading | Dependent (dims before facts) | Parallel (all tables independent) | | Business logic | In ETL or models | Separated into Business Vault | | Best for | Analytics teams | Large enterprise DWH |
When to use Data Vault: enterprise with many sources, strong audit requirements, frequent schema changes, large team.
When to use Kimball: analytics-focused team, few sources, analysts query directly, simpler ops.
2. The Three Core Components
Hub — Business Keys
A Hub stores unique business keys for a business concept. Nothing else.
-- hub_customers
CREATE TABLE raw_vault.hub_customers (
customer_hk CHAR(32) NOT NULL, -- hash key = MD5(customer_id)
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL, -- which system this came from
customer_id VARCHAR NOT NULL, -- the business key
PRIMARY KEY (customer_hk)
);-- hub_orders
CREATE TABLE raw_vault.hub_orders (
order_hk CHAR(32) NOT NULL,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL,
order_id VARCHAR NOT NULL,
PRIMARY KEY (order_hk)
);Link — Relationships
A Link captures the relationship between two or more Hubs.
-- link_order_customer (an order belongs to a customer)
CREATE TABLE raw_vault.link_order_customer (
order_customer_hk CHAR(32) NOT NULL, -- hash key of the link
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL,
-- FKs to the hubs it connects
order_hk CHAR(32) NOT NULL REFERENCES hub_orders(order_hk),
customer_hk CHAR(32) NOT NULL REFERENCES hub_customers(customer_hk),
PRIMARY KEY (order_customer_hk)
);-- link_order_product (many-to-many: an order has many products)
CREATE TABLE raw_vault.link_order_product (
order_product_hk CHAR(32) NOT NULL,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL,
order_hk CHAR(32) NOT NULL,
product_hk CHAR(32) NOT NULL,
PRIMARY KEY (order_product_hk)
);Satellite — Descriptive Attributes
A Satellite stores descriptive context (attributes) that change over time. Every change creates a new row — full history automatically.
-- sat_customers_crm (attributes from the CRM system)
CREATE TABLE raw_vault.sat_customers_crm (
customer_hk CHAR(32) NOT NULL REFERENCES hub_customers,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
load_end_date TIMESTAMP_TZ, -- NULL = current record
record_source VARCHAR NOT NULL,
hash_diff CHAR(32) NOT NULL, -- hash of all attribute values (detect changes)
-- Attributes from CRM
email VARCHAR,
full_name VARCHAR,
city VARCHAR,
country_code CHAR(3),
phone_number VARCHAR,
PRIMARY KEY (customer_hk, load_date)
);
-- sat_customers_erp (attributes from the ERP system — separate satellite!)
CREATE TABLE raw_vault.sat_customers_erp (
customer_hk CHAR(32) NOT NULL REFERENCES hub_customers,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
load_end_date TIMESTAMP_TZ,
record_source VARCHAR NOT NULL,
hash_diff CHAR(32) NOT NULL,
-- Attributes from ERP (different fields)
credit_limit DECIMAL(12, 2),
payment_terms INTEGER,
account_manager VARCHAR,
PRIMARY KEY (customer_hk, load_date)
);One Hub can have many Satellites — one per source system, or one per subject area.
3. Hash Keys
The hash key is the core identity mechanism:
-- Generate hash key in SQL (Snowflake)
MD5(UPPER(TRIM(customer_id))) AS customer_hk
-- For composite keys (multiple columns)
MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id))) AS order_customer_hk
-- Hash diff (detect attribute changes)
MD5(
COALESCE(UPPER(TRIM(email)), 'NULL') || '|' ||
COALESCE(UPPER(TRIM(full_name)), 'NULL') || '|' ||
COALESCE(UPPER(TRIM(city)), 'NULL')
) AS hash_diff-- In dbt using dbt_vault or AutomateDV
{{ automate_dv.hash('customer_id') }} AS customer_hk
{{ automate_dv.hash(['order_id', 'customer_id']) }} AS order_customer_hk
{{ automate_dv.hash_diff(['email', 'full_name', 'city']) }} AS hash_diff4. Loading Data Vault Tables
All Data Vault loads are INSERT-ONLY — no updates, no deletes. History is preserved automatically.
Loading a Hub (insert new keys only)
INSERT INTO raw_vault.hub_customers (customer_hk, load_date, record_source, customer_id)
SELECT DISTINCT
MD5(UPPER(TRIM(customer_id))) AS customer_hk,
CURRENT_TIMESTAMP AS load_date,
'orders_api' AS record_source,
customer_id
FROM staging_customers s
WHERE NOT EXISTS (
SELECT 1 FROM raw_vault.hub_customers h
WHERE h.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
);Loading a Link (insert new relationships only)
INSERT INTO raw_vault.link_order_customer
SELECT DISTINCT
MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id))) AS order_customer_hk,
CURRENT_TIMESTAMP AS load_date,
'orders_api' AS record_source,
MD5(UPPER(TRIM(order_id))) AS order_hk,
MD5(UPPER(TRIM(customer_id))) AS customer_hk
FROM staging_orders s
WHERE NOT EXISTS (
SELECT 1 FROM raw_vault.link_order_customer l
WHERE l.order_customer_hk = MD5(UPPER(TRIM(s.order_id)) || '|' || UPPER(TRIM(s.customer_id)))
);Loading a Satellite (insert only if changed)
INSERT INTO raw_vault.sat_customers_crm (
customer_hk, load_date, record_source, hash_diff,
email, full_name, city, country_code
)
SELECT
MD5(UPPER(TRIM(customer_id))) AS customer_hk,
CURRENT_TIMESTAMP AS load_date,
'crm_system' AS record_source,
MD5(COALESCE(email,'') || '|' || COALESCE(full_name,'') || '|' || COALESCE(city,''))
AS hash_diff,
email, full_name, city, country_code
FROM staging_customers s
WHERE NOT EXISTS (
SELECT 1 FROM raw_vault.sat_customers_crm cur
WHERE cur.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
AND cur.hash_diff = MD5(COALESCE(s.email,'') || '|' || ...)
AND cur.load_end_date IS NULL -- current record
);5. Business Vault
The Business Vault sits between the Raw Vault and the presentation layer. It applies business rules without modifying the raw vault.
-- Business vault: customer with derived segment
CREATE TABLE business_vault.biz_customers AS
SELECT
h.customer_hk,
h.customer_id,
s.email,
s.full_name,
s.city,
s.country_code,
-- Business rule applied here (not in raw vault)
CASE
WHEN total_revenue > 10000 THEN 'Gold'
WHEN total_revenue > 1000 THEN 'Silver'
ELSE 'Bronze'
END AS customer_segment,
s.load_date AS valid_from
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
ON h.customer_hk = s.customer_hk
AND s.load_end_date IS NULL -- current records only
LEFT JOIN (
SELECT customer_hk, SUM(amount_usd) AS total_revenue
FROM business_vault.biz_order_summary
GROUP BY 1
) rev ON rev.customer_hk = h.customer_hk;6. Information Mart (Presentation Layer)
The final layer that analysts query. Looks like dimensional modelling (star schema) built from the vault.
-- information_mart: dim_customers (built from business vault)
CREATE TABLE information_mart.dim_customers AS
SELECT
customer_hk AS customer_sk,
customer_id,
email,
full_name,
city,
country_code,
customer_segment,
valid_from
FROM business_vault.biz_customers
WHERE valid_from <= CURRENT_TIMESTAMP;7. Data Vault with dbt (AutomateDV / dbtvault)
pip install dbt-snowflake
# packages.yml
packages:
- package: Datavault-UK/automate_dv
version: [">=0.10.0"]# models/raw_vault/hubs/hub_customers.yml
version: 2
models:
- name: hub_customers
config:
materialized: incremental
unique_key: customer_hk-- models/raw_vault/hubs/hub_customers.sql
{{
automate_dv.hub(
src_pk='customer_hk',
src_nk='customer_id',
src_ldts='load_date',
src_source='record_source',
source_model='stg_customers'
)
}}-- models/raw_vault/satellites/sat_customers_crm.sql
{{
automate_dv.sat(
src_pk='customer_hk',
src_hashdiff='hash_diff',
src_payload=['email', 'full_name', 'city', 'country_code'],
src_ldts='load_date',
src_source='record_source',
source_model='stg_customers_crm'
)
}}8. Query Patterns
-- Current state of a customer (latest satellite record)
SELECT h.customer_id, s.email, s.city, s.country_code
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
ON h.customer_hk = s.customer_hk
AND s.load_end_date IS NULL;
-- Full history for a customer
SELECT h.customer_id, s.email, s.city, s.load_date, s.load_end_date
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s ON h.customer_hk = s.customer_hk
WHERE h.customer_id = 'CUST-001'
ORDER BY s.load_date;
-- Point-in-time: customer attributes at a specific time
SELECT h.customer_id, s.email, s.city
FROM hub_customers h
JOIN sat_customers_crm s
ON h.customer_hk = s.customer_hk
AND s.load_date <= '2026-03-01 00:00:00'
AND (s.load_end_date IS NULL OR s.load_end_date > '2026-03-01 00:00:00')
WHERE h.customer_id = 'CUST-001';
-- Cross-system: combine CRM and ERP data for a customer
SELECT
h.customer_id,
crm.email,
crm.city,
erp.credit_limit,
erp.payment_terms
FROM hub_customers h
JOIN sat_customers_crm crm ON h.customer_hk = crm.customer_hk AND crm.load_end_date IS NULL
JOIN sat_customers_erp erp ON h.customer_hk = erp.customer_hk AND erp.load_end_date IS NULL;Summary
| Component | Purpose | |-----------|---------| | Hub | Stores unique business keys — one hub per business concept | | Link | Stores relationships between hubs | | Satellite | Stores descriptive attributes — one per source system | | Hash key | MD5 of business key — the identity mechanism | | Hash diff | MD5 of attributes — detects changes for satellite loading | | Raw Vault | No business rules — exactly what came from the source | | Business Vault | Business rules applied on top of raw vault | | Information Mart | Star schema presentation layer for analysts | | Insert-only | All loads are append — full audit trail automatically |
Data Vault vs Kimball decision:
- Choose Kimball for analytics-first, few sources, simpler stack
- Choose Data Vault for enterprise, many sources, audit requirements, frequent schema changes
You now have a complete data modelling foundation. Next: Git workflow and CI/CD pipelines for data engineering teams.