Learnixo

Data Modelling · Lesson 3 of 3

Data Vault 2.0: Hubs, Links, Satellites, and Loading

What Is Data Vault?

Data Vault is a modelling methodology designed for enterprise data warehouses that must:

  • Integrate data from many sources (10, 50, 100+ systems)
  • Handle schema changes without rebuilding pipelines
  • Maintain a full audit trail of every record
  • Load data in parallel without dependencies between tables
  • Separate raw data storage from business logic

Invented by Dan Linstedt, the current version is Data Vault 2.0 (DV2.0).


1. Kimball vs Data Vault

| | Kimball (Dimensional) | Data Vault | |--|----------------------|-----------| | Design for | BI / analytics | Enterprise integration | | Sources | Few (3-10) | Many (10-100+) | | History | SCDs (Type 2) | Automatic — all data is append-only | | Schema changes | Hard (rebuild dims) | Easy (add satellites) | | Query complexity | Simple (star joins) | Complex (hub+sat joins) | | Loading | Dependent (dims before facts) | Parallel (all tables independent) | | Business logic | In ETL or models | Separated into Business Vault | | Best for | Analytics teams | Large enterprise DWH |

When to use Data Vault: enterprise with many sources, strong audit requirements, frequent schema changes, large team.

When to use Kimball: analytics-focused team, few sources, analysts query directly, simpler ops.


2. The Three Core Components

Hub — Business Keys

A Hub stores unique business keys for a business concept. Nothing else.

SQL
-- hub_customers
CREATE TABLE raw_vault.hub_customers (
    customer_hk      CHAR(32)        NOT NULL,   -- hash key = MD5(customer_id)
    load_date        TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source    VARCHAR         NOT NULL,   -- which system this came from
    customer_id      VARCHAR         NOT NULL,   -- the business key

    PRIMARY KEY (customer_hk)
);
SQL
-- hub_orders
CREATE TABLE raw_vault.hub_orders (
    order_hk        CHAR(32)        NOT NULL,
    load_date       TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source   VARCHAR         NOT NULL,
    order_id        VARCHAR         NOT NULL,

    PRIMARY KEY (order_hk)
);

Link — Relationships

A Link captures the relationship between two or more Hubs.

SQL
-- link_order_customer (an order belongs to a customer)
CREATE TABLE raw_vault.link_order_customer (
    order_customer_hk   CHAR(32)        NOT NULL,   -- hash key of the link
    load_date           TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source       VARCHAR         NOT NULL,

    -- FKs to the hubs it connects
    order_hk            CHAR(32)        NOT NULL   REFERENCES hub_orders(order_hk),
    customer_hk         CHAR(32)        NOT NULL   REFERENCES hub_customers(customer_hk),

    PRIMARY KEY (order_customer_hk)
);
SQL
-- link_order_product (many-to-many: an order has many products)
CREATE TABLE raw_vault.link_order_product (
    order_product_hk    CHAR(32)        NOT NULL,
    load_date           TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    record_source       VARCHAR         NOT NULL,
    order_hk            CHAR(32)        NOT NULL,
    product_hk          CHAR(32)        NOT NULL,
    PRIMARY KEY (order_product_hk)
);

Satellite — Descriptive Attributes

A Satellite stores descriptive context (attributes) that change over time. Every change creates a new row — full history automatically.

SQL
-- sat_customers_crm (attributes from the CRM system)
CREATE TABLE raw_vault.sat_customers_crm (
    customer_hk      CHAR(32)        NOT NULL   REFERENCES hub_customers,
    load_date        TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    load_end_date    TIMESTAMP_TZ,              -- NULL = current record
    record_source    VARCHAR         NOT NULL,
    hash_diff        CHAR(32)        NOT NULL,   -- hash of all attribute values (detect changes)

    -- Attributes from CRM
    email            VARCHAR,
    full_name        VARCHAR,
    city             VARCHAR,
    country_code     CHAR(3),
    phone_number     VARCHAR,

    PRIMARY KEY (customer_hk, load_date)
);

-- sat_customers_erp (attributes from the ERP system  separate satellite!)
CREATE TABLE raw_vault.sat_customers_erp (
    customer_hk      CHAR(32)        NOT NULL   REFERENCES hub_customers,
    load_date        TIMESTAMP_TZ    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    load_end_date    TIMESTAMP_TZ,
    record_source    VARCHAR         NOT NULL,
    hash_diff        CHAR(32)        NOT NULL,

    -- Attributes from ERP (different fields)
    credit_limit     DECIMAL(12, 2),
    payment_terms    INTEGER,
    account_manager  VARCHAR,

    PRIMARY KEY (customer_hk, load_date)
);

One Hub can have many Satellites — one per source system, or one per subject area.


3. Hash Keys

The hash key is the core identity mechanism:

SQL
-- Generate hash key in SQL (Snowflake)
MD5(UPPER(TRIM(customer_id)))   AS customer_hk

-- For composite keys (multiple columns)
MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id)))   AS order_customer_hk

-- Hash diff (detect attribute changes)
MD5(
    COALESCE(UPPER(TRIM(email)), 'NULL')    || '|' ||
    COALESCE(UPPER(TRIM(full_name)), 'NULL') || '|' ||
    COALESCE(UPPER(TRIM(city)), 'NULL')
)   AS hash_diff
SQL
-- In dbt using dbt_vault or AutomateDV
{{ automate_dv.hash('customer_id') }}                         AS customer_hk
{{ automate_dv.hash(['order_id', 'customer_id']) }}           AS order_customer_hk
{{ automate_dv.hash_diff(['email', 'full_name', 'city']) }}   AS hash_diff

4. Loading Data Vault Tables

All Data Vault loads are INSERT-ONLY — no updates, no deletes. History is preserved automatically.

Loading a Hub (insert new keys only)

SQL
INSERT INTO raw_vault.hub_customers (customer_hk, load_date, record_source, customer_id)
SELECT DISTINCT
    MD5(UPPER(TRIM(customer_id)))    AS customer_hk,
    CURRENT_TIMESTAMP                AS load_date,
    'orders_api'                     AS record_source,
    customer_id
FROM staging_customers s
WHERE NOT EXISTS (
    SELECT 1 FROM raw_vault.hub_customers h
    WHERE h.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
);

Loading a Link (insert new relationships only)

SQL
INSERT INTO raw_vault.link_order_customer
SELECT DISTINCT
    MD5(UPPER(TRIM(order_id)) || '|' || UPPER(TRIM(customer_id))) AS order_customer_hk,
    CURRENT_TIMESTAMP   AS load_date,
    'orders_api'        AS record_source,
    MD5(UPPER(TRIM(order_id)))      AS order_hk,
    MD5(UPPER(TRIM(customer_id)))   AS customer_hk
FROM staging_orders s
WHERE NOT EXISTS (
    SELECT 1 FROM raw_vault.link_order_customer l
    WHERE l.order_customer_hk = MD5(UPPER(TRIM(s.order_id)) || '|' || UPPER(TRIM(s.customer_id)))
);

Loading a Satellite (insert only if changed)

SQL
INSERT INTO raw_vault.sat_customers_crm (
    customer_hk, load_date, record_source, hash_diff,
    email, full_name, city, country_code
)
SELECT
    MD5(UPPER(TRIM(customer_id)))                                AS customer_hk,
    CURRENT_TIMESTAMP                                            AS load_date,
    'crm_system'                                                 AS record_source,
    MD5(COALESCE(email,'') || '|' || COALESCE(full_name,'') || '|' || COALESCE(city,''))
                                                                 AS hash_diff,
    email, full_name, city, country_code
FROM staging_customers s
WHERE NOT EXISTS (
    SELECT 1 FROM raw_vault.sat_customers_crm cur
    WHERE cur.customer_hk = MD5(UPPER(TRIM(s.customer_id)))
      AND cur.hash_diff = MD5(COALESCE(s.email,'') || '|' || ...)
      AND cur.load_end_date IS NULL   -- current record
);

5. Business Vault

The Business Vault sits between the Raw Vault and the presentation layer. It applies business rules without modifying the raw vault.

SQL
-- Business vault: customer with derived segment
CREATE TABLE business_vault.biz_customers AS
SELECT
    h.customer_hk,
    h.customer_id,
    s.email,
    s.full_name,
    s.city,
    s.country_code,
    -- Business rule applied here (not in raw vault)
    CASE
        WHEN total_revenue > 10000 THEN 'Gold'
        WHEN total_revenue > 1000  THEN 'Silver'
        ELSE 'Bronze'
    END AS customer_segment,
    s.load_date  AS valid_from
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
    ON h.customer_hk = s.customer_hk
    AND s.load_end_date IS NULL   -- current records only
LEFT JOIN (
    SELECT customer_hk, SUM(amount_usd) AS total_revenue
    FROM business_vault.biz_order_summary
    GROUP BY 1
) rev ON rev.customer_hk = h.customer_hk;

6. Information Mart (Presentation Layer)

The final layer that analysts query. Looks like dimensional modelling (star schema) built from the vault.

SQL
-- information_mart: dim_customers (built from business vault)
CREATE TABLE information_mart.dim_customers AS
SELECT
    customer_hk    AS customer_sk,
    customer_id,
    email,
    full_name,
    city,
    country_code,
    customer_segment,
    valid_from
FROM business_vault.biz_customers
WHERE valid_from <= CURRENT_TIMESTAMP;

7. Data Vault with dbt (AutomateDV / dbtvault)

Bash
pip install dbt-snowflake
# packages.yml
packages:
  - package: Datavault-UK/automate_dv
    version: [">=0.10.0"]
YAML
# models/raw_vault/hubs/hub_customers.yml
version: 2
models:
  - name: hub_customers
    config:
      materialized: incremental
      unique_key: customer_hk
SQL
-- models/raw_vault/hubs/hub_customers.sql
{{
    automate_dv.hub(
        src_pk='customer_hk',
        src_nk='customer_id',
        src_ldts='load_date',
        src_source='record_source',
        source_model='stg_customers'
    )
}}
SQL
-- models/raw_vault/satellites/sat_customers_crm.sql
{{
    automate_dv.sat(
        src_pk='customer_hk',
        src_hashdiff='hash_diff',
        src_payload=['email', 'full_name', 'city', 'country_code'],
        src_ldts='load_date',
        src_source='record_source',
        source_model='stg_customers_crm'
    )
}}

8. Query Patterns

SQL
-- Current state of a customer (latest satellite record)
SELECT h.customer_id, s.email, s.city, s.country_code
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s
    ON h.customer_hk = s.customer_hk
    AND s.load_end_date IS NULL;

-- Full history for a customer
SELECT h.customer_id, s.email, s.city, s.load_date, s.load_end_date
FROM raw_vault.hub_customers h
JOIN raw_vault.sat_customers_crm s ON h.customer_hk = s.customer_hk
WHERE h.customer_id = 'CUST-001'
ORDER BY s.load_date;

-- Point-in-time: customer attributes at a specific time
SELECT h.customer_id, s.email, s.city
FROM hub_customers h
JOIN sat_customers_crm s
    ON h.customer_hk = s.customer_hk
    AND s.load_date   <= '2026-03-01 00:00:00'
    AND (s.load_end_date IS NULL OR s.load_end_date > '2026-03-01 00:00:00')
WHERE h.customer_id = 'CUST-001';

-- Cross-system: combine CRM and ERP data for a customer
SELECT
    h.customer_id,
    crm.email,
    crm.city,
    erp.credit_limit,
    erp.payment_terms
FROM hub_customers h
JOIN sat_customers_crm crm ON h.customer_hk = crm.customer_hk AND crm.load_end_date IS NULL
JOIN sat_customers_erp erp ON h.customer_hk = erp.customer_hk AND erp.load_end_date IS NULL;

Summary

| Component | Purpose | |-----------|---------| | Hub | Stores unique business keys — one hub per business concept | | Link | Stores relationships between hubs | | Satellite | Stores descriptive attributes — one per source system | | Hash key | MD5 of business key — the identity mechanism | | Hash diff | MD5 of attributes — detects changes for satellite loading | | Raw Vault | No business rules — exactly what came from the source | | Business Vault | Business rules applied on top of raw vault | | Information Mart | Star schema presentation layer for analysts | | Insert-only | All loads are append — full audit trail automatically |

Data Vault vs Kimball decision:

  • Choose Kimball for analytics-first, few sources, simpler stack
  • Choose Data Vault for enterprise, many sources, audit requirements, frequent schema changes

You now have a complete data modelling foundation. Next: Git workflow and CI/CD pipelines for data engineering teams.