Data Modelling · Lesson 1 of 3
Dimensional Modelling: Star Schema, Facts and Dimensions
Why Data Modelling Separates Good Engineers from Great Ones
You can write perfect SQL and build flawless pipelines and still deliver a data warehouse that analysts hate. Dimensional modelling is the discipline of designing warehouse schemas that are:
- Easy to query — analysts can write SQL without needing to understand ETL
- Performant — simple joins, pre-aggregated where needed
- Consistent — same definitions used everywhere
- Extensible — add new sources without breaking existing reports
The Kimball methodology (Ralph Kimball) is the industry standard for analytical data modelling.
1. OLTP vs OLAP
OLTP (Online Transaction Processing) — optimised for writes:
-- Normalised 3NF schema — good for writes, bad for analytics
orders → order_items → products → product_categories
→ customers → addresses → countriesOLAP (Online Analytical Processing) — optimised for reads:
-- Denormalised star schema — fewer joins, faster analytics
fct_orders → dim_customers
→ dim_products
→ dim_date
→ dim_geography2. The Star Schema
A star schema has one central fact table surrounded by dimension tables:
dim_date
│
│
dim_customers ──── fct_orders ──── dim_products
│
│
dim_geographyIt looks like a star. That's it. The fact table holds measurements; dimensions hold descriptive context.
3. Fact Tables
A fact table stores measurements (metrics, events) at a specific grain.
Grain — the most important decision
The grain is the exact meaning of one row in the fact table.
Bad grain definition: "Order information" Good grain definition: "One row per order line item — a single product on a single order"
-- fct_order_lines (grain: one row per product per order)
CREATE TABLE fct_order_lines (
-- Surrogate keys (FK to dimensions)
order_line_sk INTEGER NOT NULL, -- surrogate key
order_sk INTEGER NOT NULL, -- FK → dim_orders
product_sk INTEGER NOT NULL, -- FK → dim_products
customer_sk INTEGER NOT NULL, -- FK → dim_customers
date_sk INTEGER NOT NULL, -- FK → dim_date
geography_sk INTEGER NOT NULL, -- FK → dim_geography
-- Degenerate dimensions (no separate dim table needed)
order_id VARCHAR NOT NULL, -- source order ID
line_number INTEGER NOT NULL,
-- Facts (measures)
quantity INTEGER NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
discount_amount DECIMAL(10, 2) DEFAULT 0,
line_total DECIMAL(12, 2) NOT NULL, -- quantity * unit_price - discount
-- Audit
_loaded_at TIMESTAMP_TZ DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (order_line_sk)
);Types of facts
| Type | Description | Example | |------|-------------|---------| | Additive | Sum across all dimensions | Revenue, quantity | | Semi-additive | Sum across some dimensions (not date) | Account balance (don't sum across dates) | | Non-additive | Never summed | Ratios, percentages, prices |
-- Additive: total revenue across any combination of dimensions
SELECT SUM(line_total) FROM fct_order_lines WHERE date_sk IN (...)
-- Semi-additive: account balance — latest value, not sum
SELECT AVG(balance) FROM fct_account_balances GROUP BY account_id -- OK
SELECT SUM(balance) FROM fct_account_balances -- WRONG: sums history
-- Non-additive: margin % — calculate from additive components
SELECT SUM(revenue) / SUM(cost) - 1 AS margin_pct -- CORRECT
SELECT AVG(margin_pct) -- WRONG: average of percentagesFact table types
| Type | Grain | Example | |------|-------|---------| | Transaction | One event | Each order line | | Periodic snapshot | State at regular intervals | Daily inventory count | | Accumulating snapshot | One row tracks lifecycle | Order from placed → shipped → delivered |
4. Dimension Tables
Dimension tables hold the descriptive context for facts. They have wide rows (many columns) and fewer rows than fact tables.
-- dim_customers
CREATE TABLE dim_customers (
customer_sk INTEGER NOT NULL, -- surrogate key (PK)
customer_id VARCHAR NOT NULL, -- natural/business key
full_name VARCHAR,
email VARCHAR,
city VARCHAR,
country_code CHAR(3),
country_name VARCHAR,
customer_segment VARCHAR, -- Bronze / Silver / Gold
acquisition_channel VARCHAR,
first_order_date DATE,
is_active BOOLEAN DEFAULT TRUE,
-- SCD Type 2 fields (if tracking history)
valid_from TIMESTAMP_TZ,
valid_to TIMESTAMP_TZ,
is_current BOOLEAN DEFAULT TRUE,
PRIMARY KEY (customer_sk)
);-- dim_products
CREATE TABLE dim_products (
product_sk INTEGER NOT NULL,
product_id VARCHAR NOT NULL,
product_name VARCHAR NOT NULL,
sku VARCHAR,
category VARCHAR,
subcategory VARCHAR,
brand VARCHAR,
unit_cost DECIMAL(10, 2),
unit_price DECIMAL(10, 2),
is_active BOOLEAN,
PRIMARY KEY (product_sk)
);5. The Date Dimension
The date dimension is the most important dimension. Never join on DATE_TRUNC — pre-build it.
-- dim_date (populated once, covers years ahead)
CREATE TABLE dim_date (
date_sk INTEGER NOT NULL, -- surrogate: YYYYMMDD
date_actual DATE NOT NULL,
day_of_week INTEGER, -- 1=Monday, 7=Sunday
day_name VARCHAR(9), -- 'Monday'
day_of_month INTEGER,
day_of_year INTEGER,
week_number INTEGER,
month_number INTEGER,
month_name VARCHAR(9),
quarter INTEGER, -- 1-4
year INTEGER,
is_weekend BOOLEAN,
is_holiday BOOLEAN, -- populate manually per country
fiscal_year INTEGER, -- if fiscal year ≠ calendar year
fiscal_quarter INTEGER,
PRIMARY KEY (date_sk)
);
-- Populate with a recursive CTE or date spine
INSERT INTO dim_date
WITH RECURSIVE dates AS (
SELECT '2020-01-01'::DATE AS d
UNION ALL
SELECT DATEADD('day', 1, d) FROM dates WHERE d < '2030-12-31'
)
SELECT
TO_NUMBER(TO_VARCHAR(d, 'YYYYMMDD')) AS date_sk,
d AS date_actual,
DAYOFWEEKISO(d) AS day_of_week,
DAYNAME(d) AS day_name,
DAY(d) AS day_of_month,
DAYOFYEAR(d) AS day_of_year,
WEEKOFYEAR(d) AS week_number,
MONTH(d) AS month_number,
MONTHNAME(d) AS month_name,
QUARTER(d) AS quarter,
YEAR(d) AS year,
DAYOFWEEKISO(d) >= 6 AS is_weekend,
FALSE AS is_holiday
FROM dates;6. Surrogate Keys vs Natural Keys
Natural key: the business identifier (order_id = ORD-00123, customer_id = CUST-456)
Surrogate key: a system-generated integer (1, 2, 3, ...) used as the warehouse PK
| | Natural Key | Surrogate Key | |--|-------------|--------------| | Size | Variable (string) | Small (integer) | | Stability | Can change in source | Never changes | | SCD support | Hard (key is business identity) | Easy (one SK per version) | | Join performance | Slower (string comparison) | Faster (integer comparison) | | Independence from source | No | Yes |
Always use surrogate keys in dimension tables. Keep the natural key as an attribute.
-- Generate surrogate key in dbt
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_sk
-- Or: use an incrementing sequence (Snowflake)
customer_id.nextval AS customer_sk
-- Or: hash the business key (most portable)
MD5(customer_id) AS customer_sk7. Conformed Dimensions
A conformed dimension is shared across multiple fact tables. This is what allows you to "drill across" — joining two fact tables through a shared dimension.
fct_orders ──── dim_customers ──── fct_support_ticketsIf dim_customers means the same thing in orders and support tickets, analysts can combine:
-- Drill across: revenue and ticket count per customer segment
SELECT
c.customer_segment,
SUM(o.line_total) AS total_revenue,
COUNT(t.ticket_id) AS ticket_count
FROM dim_customers c
LEFT JOIN fct_order_lines o ON o.customer_sk = c.customer_sk
LEFT JOIN fct_support_tickets t ON t.customer_sk = c.customer_sk
WHERE c.is_current = TRUE
GROUP BY 1
ORDER BY 2 DESC8. Snowflake Schema vs Star Schema
A snowflake schema normalises dimensions further:
fct_orders → dim_products → dim_product_categories → dim_departmentsvs star schema:
fct_orders → dim_products (includes category, department as columns)Prefer star schemas for analytics:
- Fewer joins = faster queries
- Simpler for analysts
- Only snowflake if dimensions are very large and heavily duplicated
9. Junk Dimensions
When you have many low-cardinality flags/codes that don't belong in any specific dimension:
-- Instead of many flag columns in fact table:
-- is_mobile, is_weekend_order, is_first_order, is_promotion_applied, payment_type
-- Create a junk dimension:
CREATE TABLE dim_order_flags (
flags_sk INTEGER NOT NULL,
is_mobile BOOLEAN,
is_weekend_order BOOLEAN,
is_first_order BOOLEAN,
is_promotion BOOLEAN,
payment_type VARCHAR, -- credit_card, bank_transfer, etc.
PRIMARY KEY (flags_sk)
);
-- Fact table then has: flags_sk → FK to dim_order_flags10. Complete Star Schema Example
-- Full star schema for e-commerce analytics
-- Dimensions
CREATE TABLE dim_date (...); -- date dimension
CREATE TABLE dim_customers (...); -- customer attributes
CREATE TABLE dim_products (...); -- product catalog
CREATE TABLE dim_geography ( -- shipping destination
geography_sk INTEGER NOT NULL,
city VARCHAR,
region VARCHAR,
country_code CHAR(3),
country_name VARCHAR,
continent VARCHAR,
PRIMARY KEY (geography_sk)
);
CREATE TABLE dim_promotions (
promotion_sk INTEGER NOT NULL,
promotion_id VARCHAR,
promotion_name VARCHAR,
discount_type VARCHAR, -- percentage / fixed
discount_value DECIMAL(8, 2),
start_date DATE,
end_date DATE,
PRIMARY KEY (promotion_sk)
);
-- Central fact table
CREATE TABLE fct_order_lines (
order_line_sk INTEGER NOT NULL,
date_sk INTEGER NOT NULL REFERENCES dim_date,
customer_sk INTEGER NOT NULL REFERENCES dim_customers,
product_sk INTEGER NOT NULL REFERENCES dim_products,
geography_sk INTEGER REFERENCES dim_geography,
promotion_sk INTEGER REFERENCES dim_promotions, -- NULL if no promo
-- Degenerate dimensions
order_id VARCHAR NOT NULL,
line_number INTEGER NOT NULL,
-- Measures
quantity INTEGER NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
discount_amount DECIMAL(10, 2) DEFAULT 0,
line_total DECIMAL(12, 2) NOT NULL,
cost_price DECIMAL(10, 2),
gross_margin DECIMAL(12, 2),
PRIMARY KEY (order_line_sk)
);Summary
| Concept | Key Point |
|---------|----------|
| Grain | Most important decision — one row means exactly what? |
| Fact table | Measurements at the grain level |
| Dimension table | Descriptive attributes for filtering/grouping |
| Surrogate key | Integer PK, generated in warehouse, never from source |
| Date dimension | Pre-build it — never join on DATE_TRUNC |
| Conformed dimensions | Shared across fact tables — enables drill-across |
| Star schema | Prefer over snowflake for analytics |
| Additive facts | Sum across all dims — most common |
| Semi-additive | Sum across some dims, not time (balances) |
| Non-additive | Calculate from additive components |
Next: Slowly Changing Dimensions — tracking how data changes over time.