Learnixo

Data Modelling · Lesson 1 of 3

Dimensional Modelling: Star Schema, Facts and Dimensions

Why Data Modelling Separates Good Engineers from Great Ones

You can write perfect SQL and build flawless pipelines and still deliver a data warehouse that analysts hate. Dimensional modelling is the discipline of designing warehouse schemas that are:

  • Easy to query — analysts can write SQL without needing to understand ETL
  • Performant — simple joins, pre-aggregated where needed
  • Consistent — same definitions used everywhere
  • Extensible — add new sources without breaking existing reports

The Kimball methodology (Ralph Kimball) is the industry standard for analytical data modelling.


1. OLTP vs OLAP

OLTP (Online Transaction Processing) — optimised for writes:

SQL
-- Normalised 3NF schema  good for writes, bad for analytics
orders  order_items  products  product_categories
        customers    addresses  countries

OLAP (Online Analytical Processing) — optimised for reads:

SQL
-- Denormalised star schema  fewer joins, faster analytics
fct_orders  dim_customers
            dim_products
            dim_date
            dim_geography

2. The Star Schema

A star schema has one central fact table surrounded by dimension tables:

                    dim_date
                       │
                       │
dim_customers ──── fct_orders ──── dim_products
                       │
                       │
                    dim_geography

It looks like a star. That's it. The fact table holds measurements; dimensions hold descriptive context.


3. Fact Tables

A fact table stores measurements (metrics, events) at a specific grain.

Grain — the most important decision

The grain is the exact meaning of one row in the fact table.

Bad grain definition: "Order information" Good grain definition: "One row per order line item — a single product on a single order"

SQL
-- fct_order_lines (grain: one row per product per order)
CREATE TABLE fct_order_lines (
    -- Surrogate keys (FK to dimensions)
    order_line_sk    INTEGER         NOT NULL,  -- surrogate key
    order_sk         INTEGER         NOT NULL,  -- FK  dim_orders
    product_sk       INTEGER         NOT NULL,  -- FK  dim_products
    customer_sk      INTEGER         NOT NULL,  -- FK  dim_customers
    date_sk          INTEGER         NOT NULL,  -- FK  dim_date
    geography_sk     INTEGER         NOT NULL,  -- FK  dim_geography

    -- Degenerate dimensions (no separate dim table needed)
    order_id         VARCHAR         NOT NULL,  -- source order ID
    line_number      INTEGER         NOT NULL,

    -- Facts (measures)
    quantity         INTEGER         NOT NULL,
    unit_price       DECIMAL(10, 2)  NOT NULL,
    discount_amount  DECIMAL(10, 2)  DEFAULT 0,
    line_total       DECIMAL(12, 2)  NOT NULL,  -- quantity * unit_price - discount

    -- Audit
    _loaded_at       TIMESTAMP_TZ    DEFAULT CURRENT_TIMESTAMP,

    PRIMARY KEY (order_line_sk)
);

Types of facts

| Type | Description | Example | |------|-------------|---------| | Additive | Sum across all dimensions | Revenue, quantity | | Semi-additive | Sum across some dimensions (not date) | Account balance (don't sum across dates) | | Non-additive | Never summed | Ratios, percentages, prices |

SQL
-- Additive: total revenue across any combination of dimensions
SELECT SUM(line_total) FROM fct_order_lines WHERE date_sk IN (...)

-- Semi-additive: account balance  latest value, not sum
SELECT AVG(balance) FROM fct_account_balances GROUP BY account_id  -- OK
SELECT SUM(balance) FROM fct_account_balances  -- WRONG: sums history

-- Non-additive: margin %  calculate from additive components
SELECT SUM(revenue) / SUM(cost) - 1 AS margin_pct  -- CORRECT
SELECT AVG(margin_pct)  -- WRONG: average of percentages

Fact table types

| Type | Grain | Example | |------|-------|---------| | Transaction | One event | Each order line | | Periodic snapshot | State at regular intervals | Daily inventory count | | Accumulating snapshot | One row tracks lifecycle | Order from placed → shipped → delivered |


4. Dimension Tables

Dimension tables hold the descriptive context for facts. They have wide rows (many columns) and fewer rows than fact tables.

SQL
-- dim_customers
CREATE TABLE dim_customers (
    customer_sk     INTEGER         NOT NULL,   -- surrogate key (PK)
    customer_id     VARCHAR         NOT NULL,   -- natural/business key
    full_name       VARCHAR,
    email           VARCHAR,
    city            VARCHAR,
    country_code    CHAR(3),
    country_name    VARCHAR,
    customer_segment VARCHAR,                   -- Bronze / Silver / Gold
    acquisition_channel VARCHAR,
    first_order_date DATE,
    is_active       BOOLEAN         DEFAULT TRUE,

    -- SCD Type 2 fields (if tracking history)
    valid_from      TIMESTAMP_TZ,
    valid_to        TIMESTAMP_TZ,
    is_current      BOOLEAN         DEFAULT TRUE,

    PRIMARY KEY (customer_sk)
);
SQL
-- dim_products
CREATE TABLE dim_products (
    product_sk          INTEGER         NOT NULL,
    product_id          VARCHAR         NOT NULL,
    product_name        VARCHAR         NOT NULL,
    sku                 VARCHAR,
    category            VARCHAR,
    subcategory         VARCHAR,
    brand               VARCHAR,
    unit_cost           DECIMAL(10, 2),
    unit_price          DECIMAL(10, 2),
    is_active           BOOLEAN,
    PRIMARY KEY (product_sk)
);

5. The Date Dimension

The date dimension is the most important dimension. Never join on DATE_TRUNC — pre-build it.

SQL
-- dim_date (populated once, covers years ahead)
CREATE TABLE dim_date (
    date_sk         INTEGER         NOT NULL,   -- surrogate: YYYYMMDD
    date_actual     DATE            NOT NULL,
    day_of_week     INTEGER,        -- 1=Monday, 7=Sunday
    day_name        VARCHAR(9),     -- 'Monday'
    day_of_month    INTEGER,
    day_of_year     INTEGER,
    week_number     INTEGER,
    month_number    INTEGER,
    month_name      VARCHAR(9),
    quarter         INTEGER,        -- 1-4
    year            INTEGER,
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN,        -- populate manually per country
    fiscal_year     INTEGER,        -- if fiscal year  calendar year
    fiscal_quarter  INTEGER,

    PRIMARY KEY (date_sk)
);

-- Populate with a recursive CTE or date spine
INSERT INTO dim_date
WITH RECURSIVE dates AS (
    SELECT '2020-01-01'::DATE AS d
    UNION ALL
    SELECT DATEADD('day', 1, d) FROM dates WHERE d < '2030-12-31'
)
SELECT
    TO_NUMBER(TO_VARCHAR(d, 'YYYYMMDD'))    AS date_sk,
    d                                        AS date_actual,
    DAYOFWEEKISO(d)                          AS day_of_week,
    DAYNAME(d)                               AS day_name,
    DAY(d)                                   AS day_of_month,
    DAYOFYEAR(d)                             AS day_of_year,
    WEEKOFYEAR(d)                            AS week_number,
    MONTH(d)                                 AS month_number,
    MONTHNAME(d)                             AS month_name,
    QUARTER(d)                               AS quarter,
    YEAR(d)                                  AS year,
    DAYOFWEEKISO(d) >= 6                     AS is_weekend,
    FALSE                                    AS is_holiday
FROM dates;

6. Surrogate Keys vs Natural Keys

Natural key: the business identifier (order_id = ORD-00123, customer_id = CUST-456) Surrogate key: a system-generated integer (1, 2, 3, ...) used as the warehouse PK

| | Natural Key | Surrogate Key | |--|-------------|--------------| | Size | Variable (string) | Small (integer) | | Stability | Can change in source | Never changes | | SCD support | Hard (key is business identity) | Easy (one SK per version) | | Join performance | Slower (string comparison) | Faster (integer comparison) | | Independence from source | No | Yes |

Always use surrogate keys in dimension tables. Keep the natural key as an attribute.

SQL
-- Generate surrogate key in dbt
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_sk

-- Or: use an incrementing sequence (Snowflake)
customer_id.nextval AS customer_sk

-- Or: hash the business key (most portable)
MD5(customer_id) AS customer_sk

7. Conformed Dimensions

A conformed dimension is shared across multiple fact tables. This is what allows you to "drill across" — joining two fact tables through a shared dimension.

fct_orders ──── dim_customers ──── fct_support_tickets

If dim_customers means the same thing in orders and support tickets, analysts can combine:

SQL
-- Drill across: revenue and ticket count per customer segment
SELECT
    c.customer_segment,
    SUM(o.line_total)     AS total_revenue,
    COUNT(t.ticket_id)    AS ticket_count
FROM dim_customers      c
LEFT JOIN fct_order_lines  o ON o.customer_sk = c.customer_sk
LEFT JOIN fct_support_tickets t ON t.customer_sk = c.customer_sk
WHERE c.is_current = TRUE
GROUP BY 1
ORDER BY 2 DESC

8. Snowflake Schema vs Star Schema

A snowflake schema normalises dimensions further:

fct_orders → dim_products → dim_product_categories → dim_departments

vs star schema:

fct_orders → dim_products (includes category, department as columns)

Prefer star schemas for analytics:

  • Fewer joins = faster queries
  • Simpler for analysts
  • Only snowflake if dimensions are very large and heavily duplicated

9. Junk Dimensions

When you have many low-cardinality flags/codes that don't belong in any specific dimension:

SQL
-- Instead of many flag columns in fact table:
-- is_mobile, is_weekend_order, is_first_order, is_promotion_applied, payment_type

-- Create a junk dimension:
CREATE TABLE dim_order_flags (
    flags_sk            INTEGER         NOT NULL,
    is_mobile           BOOLEAN,
    is_weekend_order    BOOLEAN,
    is_first_order      BOOLEAN,
    is_promotion        BOOLEAN,
    payment_type        VARCHAR,        -- credit_card, bank_transfer, etc.
    PRIMARY KEY (flags_sk)
);
-- Fact table then has: flags_sk  FK to dim_order_flags

10. Complete Star Schema Example

SQL
-- Full star schema for e-commerce analytics

-- Dimensions
CREATE TABLE dim_date       (...);   -- date dimension
CREATE TABLE dim_customers  (...);   -- customer attributes
CREATE TABLE dim_products   (...);   -- product catalog
CREATE TABLE dim_geography  (        -- shipping destination
    geography_sk    INTEGER         NOT NULL,
    city            VARCHAR,
    region          VARCHAR,
    country_code    CHAR(3),
    country_name    VARCHAR,
    continent       VARCHAR,
    PRIMARY KEY (geography_sk)
);
CREATE TABLE dim_promotions  (
    promotion_sk    INTEGER         NOT NULL,
    promotion_id    VARCHAR,
    promotion_name  VARCHAR,
    discount_type   VARCHAR,        -- percentage / fixed
    discount_value  DECIMAL(8, 2),
    start_date      DATE,
    end_date        DATE,
    PRIMARY KEY (promotion_sk)
);

-- Central fact table
CREATE TABLE fct_order_lines (
    order_line_sk   INTEGER         NOT NULL,
    date_sk         INTEGER         NOT NULL   REFERENCES dim_date,
    customer_sk     INTEGER         NOT NULL   REFERENCES dim_customers,
    product_sk      INTEGER         NOT NULL   REFERENCES dim_products,
    geography_sk    INTEGER                    REFERENCES dim_geography,
    promotion_sk    INTEGER                    REFERENCES dim_promotions,   -- NULL if no promo

    -- Degenerate dimensions
    order_id        VARCHAR         NOT NULL,
    line_number     INTEGER         NOT NULL,

    -- Measures
    quantity        INTEGER         NOT NULL,
    unit_price      DECIMAL(10, 2)  NOT NULL,
    discount_amount DECIMAL(10, 2)  DEFAULT 0,
    line_total      DECIMAL(12, 2)  NOT NULL,
    cost_price      DECIMAL(10, 2),
    gross_margin    DECIMAL(12, 2),

    PRIMARY KEY (order_line_sk)
);

Summary

| Concept | Key Point | |---------|----------| | Grain | Most important decision — one row means exactly what? | | Fact table | Measurements at the grain level | | Dimension table | Descriptive attributes for filtering/grouping | | Surrogate key | Integer PK, generated in warehouse, never from source | | Date dimension | Pre-build it — never join on DATE_TRUNC | | Conformed dimensions | Shared across fact tables — enables drill-across | | Star schema | Prefer over snowflake for analytics | | Additive facts | Sum across all dims — most common | | Semi-additive | Sum across some dims, not time (balances) | | Non-additive | Calculate from additive components |

Next: Slowly Changing Dimensions — tracking how data changes over time.