Data Vault: Hubs, Links, Satellites, and Scalable Warehouse Architecture
Master Data Vault 2.0 — the enterprise data warehouse methodology that scales across sources and time. Hubs, Links, Satellites, loading patterns, business vault, and when to choose Data Vault over Kimball.
What Is Data Vault?
Data Vault is a modelling methodology designed for enterprise data warehouses that must:
- Integrate data from many sources (10, 50, 100+ systems)
- Handle schema changes without rebuilding pipelines
- Maintain a full audit trail of every record
- Load data in parallel without dependencies between tables
- Separate raw data storage from business logic
Invented by Dan Linstedt, the current version is Data Vault 2.0 (DV2.0).
1. Kimball vs Data Vault
| | Kimball (Dimensional) | Data Vault | |--|----------------------|-----------| | Design for | BI / analytics | Enterprise integration | | Sources | Few (3-10) | Many (10-100+) | | History | SCDs (Type 2) | Automatic — all data is append-only | | Schema changes | Hard (rebuild dims) | Easy (add satellites) | | Query complexity | Simple (star joins) | Complex (hub+sat joins) | | Loading | Dependent (dims before facts) | Parallel (all tables independent) | | Business logic | In ETL or models | Separated into Business Vault | | Best for | Analytics teams | Large enterprise DWH |
When to use Data Vault: enterprise with many sources, strong audit requirements, frequent schema changes, large team.
When to use Kimball: analytics-focused team, few sources, analysts query directly, simpler ops.
2. The Three Core Components
Hub — Business Keys
A Hub stores unique business keys for a business concept. Nothing else.
-- hub_customers
CREATE TABLE raw_vault.hub_customers (
customer_hk CHAR(32) NOT NULL, -- hash key = MD5(customer_id)
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL, -- which system this came from
customer_id VARCHAR NOT NULL, -- the business key
PRIMARY KEY (customer_hk)
);-- hub_orders
CREATE TABLE raw_vault.hub_orders (
order_hk CHAR(32) NOT NULL,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL,
order_id VARCHAR NOT NULL,
PRIMARY KEY (order_hk)
);Link — Relationships
A Link captures the relationship between two or more Hubs.
-- link_order_customer (an order belongs to a customer)
CREATE TABLE raw_vault.link_order_customer (
order_customer_hk CHAR(32) NOT NULL, -- hash key of the link
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL,
-- FKs to the hubs it connects
order_hk CHAR(32) NOT NULL REFERENCES hub_orders(order_hk),
customer_hk CHAR(32) NOT NULL REFERENCES hub_customers(customer_hk),
PRIMARY KEY (order_customer_hk)
);-- link_order_product (many-to-many: an order has many products)
CREATE TABLE raw_vault.link_order_product (
order_product_hk CHAR(32) NOT NULL,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR NOT NULL,
order_hk CHAR(32) NOT NULL,
product_hk CHAR(32) NOT NULL,
PRIMARY KEY (order_product_hk)
);Satellite — Descriptive Attributes
A Satellite stores descriptive context (attributes) that change over time. Every change creates a new row — full history automatically.
-- sat_customers_crm (attributes from the CRM system)
CREATE TABLE raw_vault.sat_customers_crm (
customer_hk CHAR(32) NOT NULL REFERENCES hub_customers,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
load_end_date TIMESTAMP_TZ, -- NULL = current record
record_source VARCHAR NOT NULL,
hash_diff CHAR(32) NOT NULL, -- hash of all attribute values (detect changes)
-- Attributes from CRM
email VARCHAR,
full_name VARCHAR,
city VARCHAR,
country_code CHAR(3),
phone_number VARCHAR,
PRIMARY KEY (customer_hk, load_date)
);
-- sat_customers_erp (attributes from the ERP system — separate satellite!)
CREATE TABLE raw_vault.sat_customers_erp (
customer_hk CHAR(32) NOT NULL REFERENCES hub_customers,
load_date TIMESTAMP_TZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
load_end_date TIMESTAMP_TZ,
record_source VARCHAR NOT NULL,
hash_diff CHAR(32) NOT NULL,
-- Attributes from ERP (different fields)
credit_limit DECIMAL(12, 2),
payment_terms INTEGER,
account_manager VARCHAR,
PRIMARY KEY (customer_hk, load_date)
);One Hub can have many Satellites — one per source system, or one per subject area.
3. Hash Keys
The hash key is the core identity mechanism:
-- Generate hash key in SQL (Snowflake)
MD5(UPPER(TRIM(customer_id))) AS customer_hk
-- For composite keys (multiple columns)
MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id))) AS order_customer_hk
-- Hash diff (detect attribute changes)
MD5(
COALESCE(UPPER(TRIM(email)), 'NULL') || '|' ||
COALESCE(UPPER(TRIM(full_name)), 'NULL') || '|' ||
COALESCE(UPPER(TRIM(city)), 'NULL')
) AS hash_diff-- In dbt using dbt_vault or AutomateDV
{{ automate_dv.hash('customer_id') }} AS customer_hk
{{ automate_dv.hash(['order_id', 'customer_id']) }} AS order_customer_hk
{{ automate_dv.hash_diff(['email', 'full_name', 'city']) }} AS hash_diff4. Loading Data Vault Tables
All Data Vault loads are INSERT-ONLY — no updates, no deletes. History is preserved automatically.
Loading a Hub (insert new keys only)
INSERT INTO raw_vault.hub_customers (customer_hk, load_date, record_source, customer_id)
SELECT DISTINCT
MD5(UPPER(TRIM(customer_id))) AS customer_hk,
CURRENT_TIMESTAMP AS load_date,
'orders_api' AS record_source,
customer_id
FROM staging_customers s
WHERE NOT EXISTS (
SELECT 1 FROM raw_vault.hub_customers h
WHERE h.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
);Loading a Link (insert new relationships only)
INSERT INTO raw_vault.link_order_customer
SELECT DISTINCT
MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id))) AS order_customer_hk,
CURRENT_TIMESTAMP AS load_date,
'orders_api' AS record_source,
MD5(UPPER(TRIM(order_id))) AS order_hk,
MD5(UPPER(TRIM(customer_id))) AS customer_hk
FROM staging_orders s
WHERE NOT EXISTS (
SELECT 1 FROM raw_vault.link_order_customer l
WHERE l.order_customer_hk = MD5(UPPER(TRIM(s.order_id)) || '|' || UPPER(TRIM(s.customer_id)))
);Loading a Satellite (insert only if changed)
INSERT INTO raw_vault.sat_customers_crm (
customer_hk, load_date, record_source, hash_diff,
email, full_name, city, country_code
)
SELECT
MD5(UPPER(TRIM(customer_id))) AS customer_hk,
CURRENT_TIMESTAMP AS load_date,
'crm_system' AS record_source,
MD5(COALESCE(email,'') || '|' || COALESCE(full_name,'') || '|' || COALESCE(city,''))
AS hash_diff,
email, full_name, city, country_code
FROM staging_customers s
WHERE NOT EXISTS (
SELECT 1 FROM raw_vault.sat_customers_crm cur
WHERE cur.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
AND cur.hash_diff = MD5(COALESCE(s.email,'') || '|' || ...)
AND cur.load_end_date IS NULL -- current record
);5. Business Vault
The Business Vault sits between the Raw Vault and the presentation layer. It applies business rules without modifying the raw vault.
-- Business vault: customer with derived segment
CREATE TABLE business_vault.biz_customers AS
SELECT
h.customer_hk,
h.customer_id,
s.email,
s.full_name,
s.city,
s.country_code,
-- Business rule applied here (not in raw vault)
CASE
WHEN total_revenue > 10000 THEN 'Gold'
WHEN total_revenue > 1000 THEN 'Silver'
ELSE 'Bronze'
END AS customer_segment,
s.load_date AS valid_from
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
ON h.customer_hk = s.customer_hk
AND s.load_end_date IS NULL -- current records only
LEFT JOIN (
SELECT customer_hk, SUM(amount_usd) AS total_revenue
FROM business_vault.biz_order_summary
GROUP BY 1
) rev ON rev.customer_hk = h.customer_hk;6. Information Mart (Presentation Layer)
The final layer that analysts query. Looks like dimensional modelling (star schema) built from the vault.
-- information_mart: dim_customers (built from business vault)
CREATE TABLE information_mart.dim_customers AS
SELECT
customer_hk AS customer_sk,
customer_id,
email,
full_name,
city,
country_code,
customer_segment,
valid_from
FROM business_vault.biz_customers
WHERE valid_from <= CURRENT_TIMESTAMP;7. Data Vault with dbt (AutomateDV / dbtvault)
pip install dbt-snowflake
# packages.yml
packages:
- package: Datavault-UK/automate_dv
version: [">=0.10.0"]# models/raw_vault/hubs/hub_customers.yml
version: 2
models:
- name: hub_customers
config:
materialized: incremental
unique_key: customer_hk-- models/raw_vault/hubs/hub_customers.sql
{{
automate_dv.hub(
src_pk='customer_hk',
src_nk='customer_id',
src_ldts='load_date',
src_source='record_source',
source_model='stg_customers'
)
}}-- models/raw_vault/satellites/sat_customers_crm.sql
{{
automate_dv.sat(
src_pk='customer_hk',
src_hashdiff='hash_diff',
src_payload=['email', 'full_name', 'city', 'country_code'],
src_ldts='load_date',
src_source='record_source',
source_model='stg_customers_crm'
)
}}8. Query Patterns
-- Current state of a customer (latest satellite record)
SELECT h.customer_id, s.email, s.city, s.country_code
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
ON h.customer_hk = s.customer_hk
AND s.load_end_date IS NULL;
-- Full history for a customer
SELECT h.customer_id, s.email, s.city, s.load_date, s.load_end_date
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s ON h.customer_hk = s.customer_hk
WHERE h.customer_id = 'CUST-001'
ORDER BY s.load_date;
-- Point-in-time: customer attributes at a specific time
SELECT h.customer_id, s.email, s.city
FROM hub_customers h
JOIN sat_customers_crm s
ON h.customer_hk = s.customer_hk
AND s.load_date <= '2026-03-01 00:00:00'
AND (s.load_end_date IS NULL OR s.load_end_date > '2026-03-01 00:00:00')
WHERE h.customer_id = 'CUST-001';
-- Cross-system: combine CRM and ERP data for a customer
SELECT
h.customer_id,
crm.email,
crm.city,
erp.credit_limit,
erp.payment_terms
FROM hub_customers h
JOIN sat_customers_crm crm ON h.customer_hk = crm.customer_hk AND crm.load_end_date IS NULL
JOIN sat_customers_erp erp ON h.customer_hk = erp.customer_hk AND erp.load_end_date IS NULL;Summary
| Component | Purpose | |-----------|---------| | Hub | Stores unique business keys — one hub per business concept | | Link | Stores relationships between hubs | | Satellite | Stores descriptive attributes — one per source system | | Hash key | MD5 of business key — the identity mechanism | | Hash diff | MD5 of attributes — detects changes for satellite loading | | Raw Vault | No business rules — exactly what came from the source | | Business Vault | Business rules applied on top of raw vault | | Information Mart | Star schema presentation layer for analysts | | Insert-only | All loads are append — full audit trail automatically |
Data Vault vs Kimball decision:
- Choose Kimball for analytics-first, few sources, simpler stack
- Choose Data Vault for enterprise, many sources, audit requirements, frequent schema changes
You now have a complete data modelling foundation. Next: Git workflow and CI/CD pipelines for data engineering teams.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.