Back to blog
Data Engineeringintermediate

Data Vault: Hubs, Links, Satellites, and Scalable Warehouse Architecture

Master Data Vault 2.0 — the enterprise data warehouse methodology that scales across sources and time. Hubs, Links, Satellites, loading patterns, business vault, and when to choose Data Vault over Kimball.

LearnixoMay 7, 20268 min read
data vaultdata modellingDV2.0hubslinkssatellitesdata warehouse
Share:𝕏

What Is Data Vault?

Data Vault is a modelling methodology designed for enterprise data warehouses that must:

  • Integrate data from many sources (10, 50, 100+ systems)
  • Handle schema changes without rebuilding pipelines
  • Maintain a full audit trail of every record
  • Load data in parallel without dependencies between tables
  • Separate raw data storage from business logic

Invented by Dan Linstedt, the current version is Data Vault 2.0 (DV2.0).


1. Kimball vs Data Vault

| | Kimball (Dimensional) | Data Vault | |--|----------------------|-----------| | Design for | BI / analytics | Enterprise integration | | Sources | Few (3-10) | Many (10-100+) | | History | SCDs (Type 2) | Automatic — all data is append-only | | Schema changes | Hard (rebuild dims) | Easy (add satellites) | | Query complexity | Simple (star joins) | Complex (hub+sat joins) | | Loading | Dependent (dims before facts) | Parallel (all tables independent) | | Business logic | In ETL or models | Separated into Business Vault | | Best for | Analytics teams | Large enterprise DWH |

When to use Data Vault: enterprise with many sources, strong audit requirements, frequent schema changes, large team.

When to use Kimball: analytics-focused team, few sources, analysts query directly, simpler ops.


2. The Three Core Components

Hub — Business Keys

A Hub stores unique business keys for a business concept. Nothing else.

SQL
-- hub_customers
CREATE TABLE raw_vault.hub_customers (
    customer_hk      CHAR(32)        NOT NULL,   -- hash key = MD5(customer_id)
    load_date        TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source    VARCHAR         NOT NULL,   -- which system this came from
    customer_id      VARCHAR         NOT NULL,   -- the business key

    PRIMARY KEY (customer_hk)
);
SQL
-- hub_orders
CREATE TABLE raw_vault.hub_orders (
    order_hk        CHAR(32)        NOT NULL,
    load_date       TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source   VARCHAR         NOT NULL,
    order_id        VARCHAR         NOT NULL,

    PRIMARY KEY (order_hk)
);

Link — Relationships

A Link captures the relationship between two or more Hubs.

SQL
-- link_order_customer (an order belongs to a customer)
CREATE TABLE raw_vault.link_order_customer (
    order_customer_hk   CHAR(32)        NOT NULL,   -- hash key of the link
    load_date           TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source       VARCHAR         NOT NULL,

    -- FKs to the hubs it connects
    order_hk            CHAR(32)        NOT NULL   REFERENCES hub_orders(order_hk),
    customer_hk         CHAR(32)        NOT NULL   REFERENCES hub_customers(customer_hk),

    PRIMARY KEY (order_customer_hk)
);
SQL
-- link_order_product (many-to-many: an order has many products)
CREATE TABLE raw_vault.link_order_product (
    order_product_hk    CHAR(32)        NOT NULL,
    load_date           TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source       VARCHAR         NOT NULL,
    order_hk            CHAR(32)        NOT NULL,
    product_hk          CHAR(32)        NOT NULL,
    PRIMARY KEY (order_product_hk)
);

Satellite — Descriptive Attributes

A Satellite stores descriptive context (attributes) that change over time. Every change creates a new row — full history automatically.

SQL
-- sat_customers_crm (attributes from the CRM system)
CREATE TABLE raw_vault.sat_customers_crm (
    customer_hk      CHAR(32)        NOT NULL   REFERENCES hub_customers,
    load_date        TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    load_end_date    TIMESTAMP_TZ,              -- NULL = current record
    record_source    VARCHAR         NOT NULL,
    hash_diff        CHAR(32)        NOT NULL,   -- hash of all attribute values (detect changes)

    -- Attributes from CRM
    email            VARCHAR,
    full_name        VARCHAR,
    city             VARCHAR,
    country_code     CHAR(3),
    phone_number     VARCHAR,

    PRIMARY KEY (customer_hk, load_date)
);

-- sat_customers_erp (attributes from the ERP system  separate satellite!)
CREATE TABLE raw_vault.sat_customers_erp (
    customer_hk      CHAR(32)        NOT NULL   REFERENCES hub_customers,
    load_date        TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    load_end_date    TIMESTAMP_TZ,
    record_source    VARCHAR         NOT NULL,
    hash_diff        CHAR(32)        NOT NULL,

    -- Attributes from ERP (different fields)
    credit_limit     DECIMAL(12, 2),
    payment_terms    INTEGER,
    account_manager  VARCHAR,

    PRIMARY KEY (customer_hk, load_date)
);

One Hub can have many Satellites — one per source system, or one per subject area.


3. Hash Keys

The hash key is the core identity mechanism:

SQL
-- Generate hash key in SQL (Snowflake)
MD5(UPPER(TRIM(customer_id)))   AS customer_hk

-- For composite keys (multiple columns)
MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id)))   AS order_customer_hk

-- Hash diff (detect attribute changes)
MD5(
    COALESCE(UPPER(TRIM(email)), 'NULL')    || '|' ||
    COALESCE(UPPER(TRIM(full_name)), 'NULL') || '|' ||
    COALESCE(UPPER(TRIM(city)), 'NULL')
)   AS hash_diff
SQL
-- In dbt using dbt_vault or AutomateDV
{{ automate_dv.hash('customer_id') }}                         AS customer_hk
{{ automate_dv.hash(['order_id', 'customer_id']) }}           AS order_customer_hk
{{ automate_dv.hash_diff(['email', 'full_name', 'city']) }}   AS hash_diff

4. Loading Data Vault Tables

All Data Vault loads are INSERT-ONLY — no updates, no deletes. History is preserved automatically.

Loading a Hub (insert new keys only)

SQL
INSERT INTO raw_vault.hub_customers (customer_hk, load_date, record_source, customer_id)
SELECT DISTINCT
    MD5(UPPER(TRIM(customer_id)))    AS customer_hk,
    CURRENT_TIMESTAMP                AS load_date,
    'orders_api'                     AS record_source,
    customer_id
FROM staging_customers s
WHERE NOT EXISTS (
    SELECT 1 FROM raw_vault.hub_customers h
    WHERE h.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
);

Loading a Link (insert new relationships only)

SQL
INSERT INTO raw_vault.link_order_customer
SELECT DISTINCT
    MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id))) AS order_customer_hk,
    CURRENT_TIMESTAMP   AS load_date,
    'orders_api'        AS record_source,
    MD5(UPPER(TRIM(order_id)))      AS order_hk,
    MD5(UPPER(TRIM(customer_id)))   AS customer_hk
FROM staging_orders s
WHERE NOT EXISTS (
    SELECT 1 FROM raw_vault.link_order_customer l
    WHERE l.order_customer_hk = MD5(UPPER(TRIM(s.order_id)) || '|' || UPPER(TRIM(s.customer_id)))
);

Loading a Satellite (insert only if changed)

SQL
INSERT INTO raw_vault.sat_customers_crm (
    customer_hk, load_date, record_source, hash_diff,
    email, full_name, city, country_code
)
SELECT
    MD5(UPPER(TRIM(customer_id)))                                AS customer_hk,
    CURRENT_TIMESTAMP                                            AS load_date,
    'crm_system'                                                 AS record_source,
    MD5(COALESCE(email,'') || '|' || COALESCE(full_name,'') || '|' || COALESCE(city,''))
                                                                 AS hash_diff,
    email, full_name, city, country_code
FROM staging_customers s
WHERE NOT EXISTS (
    SELECT 1 FROM raw_vault.sat_customers_crm cur
    WHERE cur.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
      AND cur.hash_diff = MD5(COALESCE(s.email,'') || '|' || ...)
      AND cur.load_end_date IS NULL   -- current record
);

5. Business Vault

The Business Vault sits between the Raw Vault and the presentation layer. It applies business rules without modifying the raw vault.

SQL
-- Business vault: customer with derived segment
CREATE TABLE business_vault.biz_customers AS
SELECT
    h.customer_hk,
    h.customer_id,
    s.email,
    s.full_name,
    s.city,
    s.country_code,
    -- Business rule applied here (not in raw vault)
    CASE
        WHEN total_revenue > 10000 THEN 'Gold'
        WHEN total_revenue > 1000  THEN 'Silver'
        ELSE 'Bronze'
    END AS customer_segment,
    s.load_date  AS valid_from
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
    ON h.customer_hk = s.customer_hk
    AND s.load_end_date IS NULL   -- current records only
LEFT JOIN (
    SELECT customer_hk, SUM(amount_usd) AS total_revenue
    FROM business_vault.biz_order_summary
    GROUP BY 1
) rev ON rev.customer_hk = h.customer_hk;

6. Information Mart (Presentation Layer)

The final layer that analysts query. Looks like dimensional modelling (star schema) built from the vault.

SQL
-- information_mart: dim_customers (built from business vault)
CREATE TABLE information_mart.dim_customers AS
SELECT
    customer_hk    AS customer_sk,
    customer_id,
    email,
    full_name,
    city,
    country_code,
    customer_segment,
    valid_from
FROM business_vault.biz_customers
WHERE valid_from <= CURRENT_TIMESTAMP;

7. Data Vault with dbt (AutomateDV / dbtvault)

Bash
pip install dbt-snowflake
# packages.yml
packages:
  - package: Datavault-UK/automate_dv
    version: [">=0.10.0"]
YAML
# models/raw_vault/hubs/hub_customers.yml
version: 2
models:
  - name: hub_customers
    config:
      materialized: incremental
      unique_key: customer_hk
SQL
-- models/raw_vault/hubs/hub_customers.sql
{{
    automate_dv.hub(
        src_pk='customer_hk',
        src_nk='customer_id',
        src_ldts='load_date',
        src_source='record_source',
        source_model='stg_customers'
    )
}}
SQL
-- models/raw_vault/satellites/sat_customers_crm.sql
{{
    automate_dv.sat(
        src_pk='customer_hk',
        src_hashdiff='hash_diff',
        src_payload=['email', 'full_name', 'city', 'country_code'],
        src_ldts='load_date',
        src_source='record_source',
        source_model='stg_customers_crm'
    )
}}

8. Query Patterns

SQL
-- Current state of a customer (latest satellite record)
SELECT h.customer_id, s.email, s.city, s.country_code
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
    ON h.customer_hk = s.customer_hk
    AND s.load_end_date IS NULL;

-- Full history for a customer
SELECT h.customer_id, s.email, s.city, s.load_date, s.load_end_date
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s ON h.customer_hk = s.customer_hk
WHERE h.customer_id = 'CUST-001'
ORDER BY s.load_date;

-- Point-in-time: customer attributes at a specific time
SELECT h.customer_id, s.email, s.city
FROM hub_customers h
JOIN sat_customers_crm s
    ON h.customer_hk = s.customer_hk
    AND s.load_date   <= '2026-03-01 00:00:00'
    AND (s.load_end_date IS NULL OR s.load_end_date > '2026-03-01 00:00:00')
WHERE h.customer_id = 'CUST-001';

-- Cross-system: combine CRM and ERP data for a customer
SELECT
    h.customer_id,
    crm.email,
    crm.city,
    erp.credit_limit,
    erp.payment_terms
FROM hub_customers h
JOIN sat_customers_crm crm ON h.customer_hk = crm.customer_hk AND crm.load_end_date IS NULL
JOIN sat_customers_erp erp ON h.customer_hk = erp.customer_hk AND erp.load_end_date IS NULL;

Summary

| Component | Purpose | |-----------|---------| | Hub | Stores unique business keys — one hub per business concept | | Link | Stores relationships between hubs | | Satellite | Stores descriptive attributes — one per source system | | Hash key | MD5 of business key — the identity mechanism | | Hash diff | MD5 of attributes — detects changes for satellite loading | | Raw Vault | No business rules — exactly what came from the source | | Business Vault | Business rules applied on top of raw vault | | Information Mart | Star schema presentation layer for analysts | | Insert-only | All loads are append — full audit trail automatically |

Data Vault vs Kimball decision:

  • Choose Kimball for analytics-first, few sources, simpler stack
  • Choose Data Vault for enterprise, many sources, audit requirements, frequent schema changes

You now have a complete data modelling foundation. Next: Git workflow and CI/CD pipelines for data engineering teams.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.