Dimensional Data Modelling: Star Schema, Facts, and Dimensions
Master dimensional modelling ā star and snowflake schemas, fact tables, dimension tables, grain, surrogate keys, conformed dimensions, and the design principles used in real data warehouses.
Why Data Modelling Separates Good Engineers from Great Ones
You can write perfect SQL and build flawless pipelines and still deliver a data warehouse that analysts hate. Dimensional modelling is the discipline of designing warehouse schemas that are:
- Easy to query ā analysts can write SQL without needing to understand ETL
- Performant ā simple joins, pre-aggregated where needed
- Consistent ā same definitions used everywhere
- Extensible ā add new sources without breaking existing reports
The Kimball methodology (Ralph Kimball) is the industry standard for analytical data modelling.
1. OLTP vs OLAP
OLTP (Online Transaction Processing) ā optimised for writes:
-- Normalised 3NF schema ā good for writes, bad for analytics
orders ā order_items ā products ā product_categories
ā customers ā addresses ā countriesOLAP (Online Analytical Processing) ā optimised for reads:
-- Denormalised star schema ā fewer joins, faster analytics
fct_orders ā dim_customers
ā dim_products
ā dim_date
ā dim_geography2. The Star Schema
A star schema has one central fact table surrounded by dimension tables:
dim_date
ā
ā
dim_customers āāāā fct_orders āāāā dim_products
ā
ā
dim_geographyIt looks like a star. That's it. The fact table holds measurements; dimensions hold descriptive context.
3. Fact Tables
A fact table stores measurements (metrics, events) at a specific grain.
Grain ā the most important decision
The grain is the exact meaning of one row in the fact table.
Bad grain definition: "Order information" Good grain definition: "One row per order line item ā a single product on a single order"
-- fct_order_lines (grain: one row per product per order)
CREATE TABLE fct_order_lines (
-- Surrogate keys (FK to dimensions)
order_line_sk INTEGER NOT NULL, -- surrogate key
order_sk INTEGER NOT NULL, -- FK ā dim_orders
product_sk INTEGER NOT NULL, -- FK ā dim_products
customer_sk INTEGER NOT NULL, -- FK ā dim_customers
date_sk INTEGER NOT NULL, -- FK ā dim_date
geography_sk INTEGER NOT NULL, -- FK ā dim_geography
-- Degenerate dimensions (no separate dim table needed)
order_id VARCHAR NOT NULL, -- source order ID
line_number INTEGER NOT NULL,
-- Facts (measures)
quantity INTEGER NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
discount_amount DECIMAL(10, 2) DEFAULT 0,
line_total DECIMAL(12, 2) NOT NULL, -- quantity * unit_price - discount
-- Audit
_loaded_at TIMESTAMP_TZ DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (order_line_sk)
);Types of facts
| Type | Description | Example | |------|-------------|---------| | Additive | Sum across all dimensions | Revenue, quantity | | Semi-additive | Sum across some dimensions (not date) | Account balance (don't sum across dates) | | Non-additive | Never summed | Ratios, percentages, prices |
-- Additive: total revenue across any combination of dimensions
SELECT SUM(line_total) FROM fct_order_lines WHERE date_sk IN (...)
-- Semi-additive: account balance ā latest value, not sum
SELECT AVG(balance) FROM fct_account_balances GROUP BY account_id -- OK
SELECT SUM(balance) FROM fct_account_balances -- WRONG: sums history
-- Non-additive: margin % ā calculate from additive components
SELECT SUM(revenue) / SUM(cost) - 1 AS margin_pct -- CORRECT
SELECT AVG(margin_pct) -- WRONG: average of percentagesFact table types
| Type | Grain | Example | |------|-------|---------| | Transaction | One event | Each order line | | Periodic snapshot | State at regular intervals | Daily inventory count | | Accumulating snapshot | One row tracks lifecycle | Order from placed ā shipped ā delivered |
4. Dimension Tables
Dimension tables hold the descriptive context for facts. They have wide rows (many columns) and fewer rows than fact tables.
-- dim_customers
CREATE TABLE dim_customers (
customer_sk INTEGER NOT NULL, -- surrogate key (PK)
customer_id VARCHAR NOT NULL, -- natural/business key
full_name VARCHAR,
email VARCHAR,
city VARCHAR,
country_code CHAR(3),
country_name VARCHAR,
customer_segment VARCHAR, -- Bronze / Silver / Gold
acquisition_channel VARCHAR,
first_order_date DATE,
is_active BOOLEAN DEFAULT TRUE,
-- SCD Type 2 fields (if tracking history)
valid_from TIMESTAMP_TZ,
valid_to TIMESTAMP_TZ,
is_current BOOLEAN DEFAULT TRUE,
PRIMARY KEY (customer_sk)
);-- dim_products
CREATE TABLE dim_products (
product_sk INTEGER NOT NULL,
product_id VARCHAR NOT NULL,
product_name VARCHAR NOT NULL,
sku VARCHAR,
category VARCHAR,
subcategory VARCHAR,
brand VARCHAR,
unit_cost DECIMAL(10, 2),
unit_price DECIMAL(10, 2),
is_active BOOLEAN,
PRIMARY KEY (product_sk)
);5. The Date Dimension
The date dimension is the most important dimension. Never join on DATE_TRUNC ā pre-build it.
-- dim_date (populated once, covers years ahead)
CREATE TABLE dim_date (
date_sk INTEGER NOT NULL, -- surrogate: YYYYMMDD
date_actual DATE NOT NULL,
day_of_week INTEGER, -- 1=Monday, 7=Sunday
day_name VARCHAR(9), -- 'Monday'
day_of_month INTEGER,
day_of_year INTEGER,
week_number INTEGER,
month_number INTEGER,
month_name VARCHAR(9),
quarter INTEGER, -- 1-4
year INTEGER,
is_weekend BOOLEAN,
is_holiday BOOLEAN, -- populate manually per country
fiscal_year INTEGER, -- if fiscal year ā calendar year
fiscal_quarter INTEGER,
PRIMARY KEY (date_sk)
);
-- Populate with a recursive CTE or date spine
INSERT INTO dim_date
WITH RECURSIVE dates AS (
SELECT '2020-01-01'::DATE AS d
UNION ALL
SELECT DATEADD('day', 1, d) FROM dates WHERE d < '2030-12-31'
)
SELECT
TO_NUMBER(TO_VARCHAR(d, 'YYYYMMDD')) AS date_sk,
d AS date_actual,
DAYOFWEEKISO(d) AS day_of_week,
DAYNAME(d) AS day_name,
DAY(d) AS day_of_month,
DAYOFYEAR(d) AS day_of_year,
WEEKOFYEAR(d) AS week_number,
MONTH(d) AS month_number,
MONTHNAME(d) AS month_name,
QUARTER(d) AS quarter,
YEAR(d) AS year,
DAYOFWEEKISO(d) >= 6 AS is_weekend,
FALSE AS is_holiday
FROM dates;6. Surrogate Keys vs Natural Keys
Natural key: the business identifier (order_id = ORD-00123, customer_id = CUST-456)
Surrogate key: a system-generated integer (1, 2, 3, ...) used as the warehouse PK
| | Natural Key | Surrogate Key | |--|-------------|--------------| | Size | Variable (string) | Small (integer) | | Stability | Can change in source | Never changes | | SCD support | Hard (key is business identity) | Easy (one SK per version) | | Join performance | Slower (string comparison) | Faster (integer comparison) | | Independence from source | No | Yes |
Always use surrogate keys in dimension tables. Keep the natural key as an attribute.
-- Generate surrogate key in dbt
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_sk
-- Or: use an incrementing sequence (Snowflake)
customer_id.nextval AS customer_sk
-- Or: hash the business key (most portable)
MD5(customer_id) AS customer_sk7. Conformed Dimensions
A conformed dimension is shared across multiple fact tables. This is what allows you to "drill across" ā joining two fact tables through a shared dimension.
fct_orders āāāā dim_customers āāāā fct_support_ticketsIf dim_customers means the same thing in orders and support tickets, analysts can combine:
-- Drill across: revenue and ticket count per customer segment
SELECT
c.customer_segment,
SUM(o.line_total) AS total_revenue,
COUNT(t.ticket_id) AS ticket_count
FROM dim_customers c
LEFT JOIN fct_order_lines o ON o.customer_sk = c.customer_sk
LEFT JOIN fct_support_tickets t ON t.customer_sk = c.customer_sk
WHERE c.is_current = TRUE
GROUP BY 1
ORDER BY 2 DESC8. Snowflake Schema vs Star Schema
A snowflake schema normalises dimensions further:
fct_orders ā dim_products ā dim_product_categories ā dim_departmentsvs star schema:
fct_orders ā dim_products (includes category, department as columns)Prefer star schemas for analytics:
- Fewer joins = faster queries
- Simpler for analysts
- Only snowflake if dimensions are very large and heavily duplicated
9. Junk Dimensions
When you have many low-cardinality flags/codes that don't belong in any specific dimension:
-- Instead of many flag columns in fact table:
-- is_mobile, is_weekend_order, is_first_order, is_promotion_applied, payment_type
-- Create a junk dimension:
CREATE TABLE dim_order_flags (
flags_sk INTEGER NOT NULL,
is_mobile BOOLEAN,
is_weekend_order BOOLEAN,
is_first_order BOOLEAN,
is_promotion BOOLEAN,
payment_type VARCHAR, -- credit_card, bank_transfer, etc.
PRIMARY KEY (flags_sk)
);
-- Fact table then has: flags_sk ā FK to dim_order_flags10. Complete Star Schema Example
-- Full star schema for e-commerce analytics
-- Dimensions
CREATE TABLE dim_date (...); -- date dimension
CREATE TABLE dim_customers (...); -- customer attributes
CREATE TABLE dim_products (...); -- product catalog
CREATE TABLE dim_geography ( -- shipping destination
geography_sk INTEGER NOT NULL,
city VARCHAR,
region VARCHAR,
country_code CHAR(3),
country_name VARCHAR,
continent VARCHAR,
PRIMARY KEY (geography_sk)
);
CREATE TABLE dim_promotions (
promotion_sk INTEGER NOT NULL,
promotion_id VARCHAR,
promotion_name VARCHAR,
discount_type VARCHAR, -- percentage / fixed
discount_value DECIMAL(8, 2),
start_date DATE,
end_date DATE,
PRIMARY KEY (promotion_sk)
);
-- Central fact table
CREATE TABLE fct_order_lines (
order_line_sk INTEGER NOT NULL,
date_sk INTEGER NOT NULL REFERENCES dim_date,
customer_sk INTEGER NOT NULL REFERENCES dim_customers,
product_sk INTEGER NOT NULL REFERENCES dim_products,
geography_sk INTEGER REFERENCES dim_geography,
promotion_sk INTEGER REFERENCES dim_promotions, -- NULL if no promo
-- Degenerate dimensions
order_id VARCHAR NOT NULL,
line_number INTEGER NOT NULL,
-- Measures
quantity INTEGER NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
discount_amount DECIMAL(10, 2) DEFAULT 0,
line_total DECIMAL(12, 2) NOT NULL,
cost_price DECIMAL(10, 2),
gross_margin DECIMAL(12, 2),
PRIMARY KEY (order_line_sk)
);Summary
| Concept | Key Point |
|---------|----------|
| Grain | Most important decision ā one row means exactly what? |
| Fact table | Measurements at the grain level |
| Dimension table | Descriptive attributes for filtering/grouping |
| Surrogate key | Integer PK, generated in warehouse, never from source |
| Date dimension | Pre-build it ā never join on DATE_TRUNC |
| Conformed dimensions | Shared across fact tables ā enables drill-across |
| Star schema | Prefer over snowflake for analytics |
| Additive facts | Sum across all dims ā most common |
| Semi-additive | Sum across some dims, not time (balances) |
| Non-additive | Calculate from additive components |
Next: Slowly Changing Dimensions ā tracking how data changes over time.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.