Back to blog
Data Engineeringintermediate

Dimensional Data Modelling: Star Schema, Facts, and Dimensions

Master dimensional modelling — star and snowflake schemas, fact tables, dimension tables, grain, surrogate keys, conformed dimensions, and the design principles used in real data warehouses.

LearnixoMay 7, 20268 min read
data modellingdimensional modellingstar schemafact tablesdata warehouseKimball
Share:š•

Why Data Modelling Separates Good Engineers from Great Ones

You can write perfect SQL and build flawless pipelines and still deliver a data warehouse that analysts hate. Dimensional modelling is the discipline of designing warehouse schemas that are:

  • Easy to query — analysts can write SQL without needing to understand ETL
  • Performant — simple joins, pre-aggregated where needed
  • Consistent — same definitions used everywhere
  • Extensible — add new sources without breaking existing reports

The Kimball methodology (Ralph Kimball) is the industry standard for analytical data modelling.


1. OLTP vs OLAP

OLTP (Online Transaction Processing) — optimised for writes:

SQL
-- Normalised 3NF schema — good for writes, bad for analytics
orders → order_items → products → product_categories
       → customers   → addresses → countries

OLAP (Online Analytical Processing) — optimised for reads:

SQL
-- Denormalised star schema — fewer joins, faster analytics
fct_orders → dim_customers
           → dim_products
           → dim_date
           → dim_geography

2. The Star Schema

A star schema has one central fact table surrounded by dimension tables:

                    dim_date
                       │
                       │
dim_customers ──── fct_orders ──── dim_products
                       │
                       │
                    dim_geography

It looks like a star. That's it. The fact table holds measurements; dimensions hold descriptive context.


3. Fact Tables

A fact table stores measurements (metrics, events) at a specific grain.

Grain — the most important decision

The grain is the exact meaning of one row in the fact table.

Bad grain definition: "Order information" Good grain definition: "One row per order line item — a single product on a single order"

SQL
-- fct_order_lines (grain: one row per product per order)
CREATE TABLE fct_order_lines (
    -- Surrogate keys (FK to dimensions)
    order_line_sk    INTEGER         NOT NULL,  -- surrogate key
    order_sk         INTEGER         NOT NULL,  -- FK → dim_orders
    product_sk       INTEGER         NOT NULL,  -- FK → dim_products
    customer_sk      INTEGER         NOT NULL,  -- FK → dim_customers
    date_sk          INTEGER         NOT NULL,  -- FK → dim_date
    geography_sk     INTEGER         NOT NULL,  -- FK → dim_geography

    -- Degenerate dimensions (no separate dim table needed)
    order_id         VARCHAR         NOT NULL,  -- source order ID
    line_number      INTEGER         NOT NULL,

    -- Facts (measures)
    quantity         INTEGER         NOT NULL,
    unit_price       DECIMAL(10, 2)  NOT NULL,
    discount_amount  DECIMAL(10, 2)  DEFAULT 0,
    line_total       DECIMAL(12, 2)  NOT NULL,  -- quantity * unit_price - discount

    -- Audit
    _loaded_at       TIMESTAMP_TZ    DEFAULT CURRENT_TIMESTAMP,

    PRIMARY KEY (order_line_sk)
);

Types of facts

| Type | Description | Example | |------|-------------|---------| | Additive | Sum across all dimensions | Revenue, quantity | | Semi-additive | Sum across some dimensions (not date) | Account balance (don't sum across dates) | | Non-additive | Never summed | Ratios, percentages, prices |

SQL
-- Additive: total revenue across any combination of dimensions
SELECT SUM(line_total) FROM fct_order_lines WHERE date_sk IN (...)

-- Semi-additive: account balance — latest value, not sum
SELECT AVG(balance) FROM fct_account_balances GROUP BY account_id  -- OK
SELECT SUM(balance) FROM fct_account_balances  -- WRONG: sums history

-- Non-additive: margin % — calculate from additive components
SELECT SUM(revenue) / SUM(cost) - 1 AS margin_pct  -- CORRECT
SELECT AVG(margin_pct)  -- WRONG: average of percentages

Fact table types

| Type | Grain | Example | |------|-------|---------| | Transaction | One event | Each order line | | Periodic snapshot | State at regular intervals | Daily inventory count | | Accumulating snapshot | One row tracks lifecycle | Order from placed → shipped → delivered |


4. Dimension Tables

Dimension tables hold the descriptive context for facts. They have wide rows (many columns) and fewer rows than fact tables.

SQL
-- dim_customers
CREATE TABLE dim_customers (
    customer_sk     INTEGER         NOT NULL,   -- surrogate key (PK)
    customer_id     VARCHAR         NOT NULL,   -- natural/business key
    full_name       VARCHAR,
    email           VARCHAR,
    city            VARCHAR,
    country_code    CHAR(3),
    country_name    VARCHAR,
    customer_segment VARCHAR,                   -- Bronze / Silver / Gold
    acquisition_channel VARCHAR,
    first_order_date DATE,
    is_active       BOOLEAN         DEFAULT TRUE,

    -- SCD Type 2 fields (if tracking history)
    valid_from      TIMESTAMP_TZ,
    valid_to        TIMESTAMP_TZ,
    is_current      BOOLEAN         DEFAULT TRUE,

    PRIMARY KEY (customer_sk)
);
SQL
-- dim_products
CREATE TABLE dim_products (
    product_sk          INTEGER         NOT NULL,
    product_id          VARCHAR         NOT NULL,
    product_name        VARCHAR         NOT NULL,
    sku                 VARCHAR,
    category            VARCHAR,
    subcategory         VARCHAR,
    brand               VARCHAR,
    unit_cost           DECIMAL(10, 2),
    unit_price          DECIMAL(10, 2),
    is_active           BOOLEAN,
    PRIMARY KEY (product_sk)
);

5. The Date Dimension

The date dimension is the most important dimension. Never join on DATE_TRUNC — pre-build it.

SQL
-- dim_date (populated once, covers years ahead)
CREATE TABLE dim_date (
    date_sk         INTEGER         NOT NULL,   -- surrogate: YYYYMMDD
    date_actual     DATE            NOT NULL,
    day_of_week     INTEGER,        -- 1=Monday, 7=Sunday
    day_name        VARCHAR(9),     -- 'Monday'
    day_of_month    INTEGER,
    day_of_year     INTEGER,
    week_number     INTEGER,
    month_number    INTEGER,
    month_name      VARCHAR(9),
    quarter         INTEGER,        -- 1-4
    year            INTEGER,
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN,        -- populate manually per country
    fiscal_year     INTEGER,        -- if fiscal year ≠ calendar year
    fiscal_quarter  INTEGER,

    PRIMARY KEY (date_sk)
);

-- Populate with a recursive CTE or date spine
INSERT INTO dim_date
WITH RECURSIVE dates AS (
    SELECT '2020-01-01'::DATE AS d
    UNION ALL
    SELECT DATEADD('day', 1, d) FROM dates WHERE d < '2030-12-31'
)
SELECT
    TO_NUMBER(TO_VARCHAR(d, 'YYYYMMDD'))    AS date_sk,
    d                                        AS date_actual,
    DAYOFWEEKISO(d)                          AS day_of_week,
    DAYNAME(d)                               AS day_name,
    DAY(d)                                   AS day_of_month,
    DAYOFYEAR(d)                             AS day_of_year,
    WEEKOFYEAR(d)                            AS week_number,
    MONTH(d)                                 AS month_number,
    MONTHNAME(d)                             AS month_name,
    QUARTER(d)                               AS quarter,
    YEAR(d)                                  AS year,
    DAYOFWEEKISO(d) >= 6                     AS is_weekend,
    FALSE                                    AS is_holiday
FROM dates;

6. Surrogate Keys vs Natural Keys

Natural key: the business identifier (order_id = ORD-00123, customer_id = CUST-456) Surrogate key: a system-generated integer (1, 2, 3, ...) used as the warehouse PK

| | Natural Key | Surrogate Key | |--|-------------|--------------| | Size | Variable (string) | Small (integer) | | Stability | Can change in source | Never changes | | SCD support | Hard (key is business identity) | Easy (one SK per version) | | Join performance | Slower (string comparison) | Faster (integer comparison) | | Independence from source | No | Yes |

Always use surrogate keys in dimension tables. Keep the natural key as an attribute.

SQL
-- Generate surrogate key in dbt
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_sk

-- Or: use an incrementing sequence (Snowflake)
customer_id.nextval AS customer_sk

-- Or: hash the business key (most portable)
MD5(customer_id) AS customer_sk

7. Conformed Dimensions

A conformed dimension is shared across multiple fact tables. This is what allows you to "drill across" — joining two fact tables through a shared dimension.

fct_orders ──── dim_customers ──── fct_support_tickets

If dim_customers means the same thing in orders and support tickets, analysts can combine:

SQL
-- Drill across: revenue and ticket count per customer segment
SELECT
    c.customer_segment,
    SUM(o.line_total)     AS total_revenue,
    COUNT(t.ticket_id)    AS ticket_count
FROM dim_customers      c
LEFT JOIN fct_order_lines  o ON o.customer_sk = c.customer_sk
LEFT JOIN fct_support_tickets t ON t.customer_sk = c.customer_sk
WHERE c.is_current = TRUE
GROUP BY 1
ORDER BY 2 DESC

8. Snowflake Schema vs Star Schema

A snowflake schema normalises dimensions further:

fct_orders → dim_products → dim_product_categories → dim_departments

vs star schema:

fct_orders → dim_products (includes category, department as columns)

Prefer star schemas for analytics:

  • Fewer joins = faster queries
  • Simpler for analysts
  • Only snowflake if dimensions are very large and heavily duplicated

9. Junk Dimensions

When you have many low-cardinality flags/codes that don't belong in any specific dimension:

SQL
-- Instead of many flag columns in fact table:
-- is_mobile, is_weekend_order, is_first_order, is_promotion_applied, payment_type

-- Create a junk dimension:
CREATE TABLE dim_order_flags (
    flags_sk            INTEGER         NOT NULL,
    is_mobile           BOOLEAN,
    is_weekend_order    BOOLEAN,
    is_first_order      BOOLEAN,
    is_promotion        BOOLEAN,
    payment_type        VARCHAR,        -- credit_card, bank_transfer, etc.
    PRIMARY KEY (flags_sk)
);
-- Fact table then has: flags_sk → FK to dim_order_flags

10. Complete Star Schema Example

SQL
-- Full star schema for e-commerce analytics

-- Dimensions
CREATE TABLE dim_date       (...);   -- date dimension
CREATE TABLE dim_customers  (...);   -- customer attributes
CREATE TABLE dim_products   (...);   -- product catalog
CREATE TABLE dim_geography  (        -- shipping destination
    geography_sk    INTEGER         NOT NULL,
    city            VARCHAR,
    region          VARCHAR,
    country_code    CHAR(3),
    country_name    VARCHAR,
    continent       VARCHAR,
    PRIMARY KEY (geography_sk)
);
CREATE TABLE dim_promotions  (
    promotion_sk    INTEGER         NOT NULL,
    promotion_id    VARCHAR,
    promotion_name  VARCHAR,
    discount_type   VARCHAR,        -- percentage / fixed
    discount_value  DECIMAL(8, 2),
    start_date      DATE,
    end_date        DATE,
    PRIMARY KEY (promotion_sk)
);

-- Central fact table
CREATE TABLE fct_order_lines (
    order_line_sk   INTEGER         NOT NULL,
    date_sk         INTEGER         NOT NULL   REFERENCES dim_date,
    customer_sk     INTEGER         NOT NULL   REFERENCES dim_customers,
    product_sk      INTEGER         NOT NULL   REFERENCES dim_products,
    geography_sk    INTEGER                    REFERENCES dim_geography,
    promotion_sk    INTEGER                    REFERENCES dim_promotions,   -- NULL if no promo

    -- Degenerate dimensions
    order_id        VARCHAR         NOT NULL,
    line_number     INTEGER         NOT NULL,

    -- Measures
    quantity        INTEGER         NOT NULL,
    unit_price      DECIMAL(10, 2)  NOT NULL,
    discount_amount DECIMAL(10, 2)  DEFAULT 0,
    line_total      DECIMAL(12, 2)  NOT NULL,
    cost_price      DECIMAL(10, 2),
    gross_margin    DECIMAL(12, 2),

    PRIMARY KEY (order_line_sk)
);

Summary

| Concept | Key Point | |---------|----------| | Grain | Most important decision — one row means exactly what? | | Fact table | Measurements at the grain level | | Dimension table | Descriptive attributes for filtering/grouping | | Surrogate key | Integer PK, generated in warehouse, never from source | | Date dimension | Pre-build it — never join on DATE_TRUNC | | Conformed dimensions | Shared across fact tables — enables drill-across | | Star schema | Prefer over snowflake for analytics | | Additive facts | Sum across all dims — most common | | Semi-additive | Sum across some dims, not time (balances) | | Non-additive | Calculate from additive components |

Next: Slowly Changing Dimensions — tracking how data changes over time.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.