dbt Fundamentals: Models, Sources, Refs, and Materializations
Master dbt (data build tool) from scratch — project setup, SQL models, sources, refs, materializations, seeds, and the analytics engineering workflow used in modern data teams.
What Is dbt?
dbt (data build tool) is the transformation layer in modern ELT pipelines. It lets you write SQL SELECT statements and handles everything else: dependency resolution, running them in the right order, testing, documentation, and lineage.
The dbt mental model:
Source data in warehouse ──► dbt (SQL models) ──► Analytics-ready tablesdbt does NOT extract or load data. It transforms data that is already in your warehouse (Snowflake, BigQuery, Redshift, DuckDB).
1. Installation and Project Setup
pip install dbt-snowflake # or dbt-bigquery, dbt-redshift, dbt-duckdb
# Create a new project
dbt init my_project
cd my_projectProject structure
my_project/
├── dbt_project.yml # project config
├── profiles.yml # connection credentials (in ~/.dbt/)
├── models/
│ ├── staging/ # 1:1 with sources, light cleaning
│ │ ├── stg_orders.sql
│ │ └── schema.yml
│ ├── intermediate/ # joins, business logic (optional layer)
│ │ └── int_order_items.sql
│ └── marts/ # final business tables
│ ├── finance/
│ │ ├── fct_orders.sql
│ │ └── dim_customers.sql
│ └── schema.yml
├── seeds/ # static CSV files loaded as tables
│ └── country_codes.csv
├── snapshots/ # SCD Type 2 history tables
│ └── orders_snapshot.sql
├── tests/ # custom singular tests
│ └── assert_positive_revenue.sql
├── macros/ # reusable Jinja functions
│ └── generate_surrogate_key.sql
└── analyses/ # ad-hoc SQL (not materialized)
└── revenue_deep_dive.sqldbt_project.yml
name: my_project
version: "1.0.0"
profile: my_profile
model-paths: ["models"]
seed-paths: ["seeds"]
snapshot-paths: ["snapshots"]
test-paths: ["tests"]
macro-paths: ["macros"]
models:
my_project:
staging:
+materialized: view # all staging models are views
+schema: staging
intermediate:
+materialized: ephemeral # no table created — inlined in queries
marts:
+materialized: table # marts are persisted as tables
finance:
+schema: financeprofiles.yml (~/.dbt/profiles.yml)
my_profile:
target: dev
outputs:
dev:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: TRANSFORMER
database: ANALYTICS
warehouse: TRANSFORM_WH
schema: DEV_YOURNAME # each dev gets their own schema
threads: 4
prod:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: TRANSFORMER
database: ANALYTICS
warehouse: TRANSFORM_WH
schema: MARTS
threads: 82. Sources — Declaring Raw Data
Sources tell dbt where your raw data lives:
# models/staging/sources.yml
version: 2
sources:
- name: raw
database: ANALYTICS
schema: RAW
description: "Raw data from ingestion pipelines"
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "Raw orders from the orders API"
loaded_at_field: _loaded_at # column dbt uses for freshness
columns:
- name: order_id
description: "Unique order identifier"
- name: amount
description: "Order amount (raw string from API)"
- name: status
- name: customers
description: "Raw customer records"
loaded_at_field: _loaded_at
- name: productsCheck freshness:
dbt source freshness3. Your First Model
-- models/staging/stg_orders.sql
-- This model cleans raw orders: type casting, renaming, null handling
WITH source AS (
SELECT *
FROM {{ source('raw', 'orders') }} -- references the source defined in sources.yml
),
renamed AS (
SELECT
order_id AS order_id,
customer_id AS customer_id,
product_id AS product_id,
TRY_CAST(quantity AS INTEGER) AS quantity,
TRY_CAST(amount AS DECIMAL(12, 2)) AS amount_usd,
UPPER(TRIM(COALESCE(currency, 'USD'))) AS currency,
LOWER(TRIM(status)) AS status,
TRY_TO_TIMESTAMP_TZ(created_at) AS created_at,
_loaded_at
FROM source
WHERE order_id IS NOT NULL
AND amount IS NOT NULL
)
SELECT *
FROM renamed
WHERE amount_usd > 0Run it:
dbt run --select stg_orders # run one model
dbt run # run all models
dbt run --select staging.* # run all staging models
dbt run --select +fct_orders # run fct_orders and all its upstream deps
dbt run --select fct_orders+ # run fct_orders and all downstream4. ref() — The Core of dbt Dependency Management
ref() is how models reference other models. dbt builds a DAG from these references:
-- models/staging/stg_customers.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'customers') }}
)
SELECT
customer_id,
TRIM(LOWER(email)) AS email,
TRIM(full_name) AS full_name,
country_code,
created_at
FROM source
WHERE customer_id IS NOT NULL-- models/marts/finance/fct_orders.sql
-- References stg_orders and stg_customers via ref()
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }} -- ref to staging model
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
products AS (
SELECT * FROM {{ ref('stg_products') }}
)
SELECT
o.order_id,
o.customer_id,
c.email AS customer_email,
c.country_code,
o.product_id,
p.category AS product_category,
o.quantity,
o.amount_usd,
o.amount_usd * o.quantity AS line_total,
o.status,
o.created_at,
DATE_TRUNC('month', o.created_at) AS order_month
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_iddbt knows fct_orders depends on stg_orders, stg_customers, and stg_products. It always builds them first.
5. Materializations
How dbt persists a model in the warehouse:
| Materialization | What it creates | When to use |
|----------------|----------------|-------------|
| view | SQL view — no data stored | Staging models, lightweight transforms |
| table | Physical table — full rebuild each run | Medium tables, final marts |
| incremental | Append/upsert only new rows | Large tables (millions of rows) |
| ephemeral | No table — inlined as CTE | Reusable logic, not queried directly |
Incremental models — the most important
-- models/marts/finance/fct_orders.sql
-- Only processes new records on each run
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
on_schema_change='sync_all_columns',
)
}}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
-- Only process rows newer than the latest record in the target
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
),
...
SELECT ...
FROM orders
JOIN ...dbt run --select fct_orders # incremental run (normal)
dbt run --select fct_orders --full-refresh # rebuild from scratch6. Seeds — Static CSV Data as Tables
# seeds/country_codes.csv
country_code,country_name,region
US,United States,Americas
GB,United Kingdom,Europe
NO,Norway,Europe
DE,Germany,Europedbt seed # loads seeds to warehouse as tables
dbt seed --select country_codesReference in models:
SELECT o.*, c.country_name, c.region
FROM {{ ref('fct_orders') }} o
JOIN {{ ref('country_codes') }} c ON o.country_code = c.country_code7. Schema.yml — Document and Test in One File
# models/marts/finance/schema.yml
version: 2
models:
- name: fct_orders
description: >
One row per completed order. Joins orders, customers, and products.
Primary grain is order_id. Updated incrementally.
columns:
- name: order_id
description: "Unique order identifier"
tests:
- not_null
- unique
- name: customer_id
description: "FK to dim_customers"
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: amount_usd
description: "Order value in USD"
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: status
tests:
- accepted_values:
values: [pending, processing, shipped, delivered, cancelled]
- name: created_at
tests:
- not_null8. Running and Key Commands
# Development workflow
dbt debug # test connection
dbt compile # compile SQL without running
dbt run # run all models
dbt test # run all tests
dbt build # run + test in one command (recommended)
dbt docs generate # generate documentation
dbt docs serve # serve docs at localhost:8080
# Selective runs
dbt run --select stg_orders
dbt run --select staging.*
dbt run --select +fct_orders # fct_orders + all upstream
dbt run --select fct_orders+ # fct_orders + all downstream
dbt run --select tag:finance # models with tag:finance
dbt run --exclude stg_products
# In production (CI/CD)
dbt build --target prod # use prod profile
dbt build --select state:modified+ # only changed models + their downstream9. Variables and Environments
# dbt_project.yml
vars:
start_date: "2024-01-01"
default_schema: "analytics"
payment_methods: ['credit_card', 'bank_transfer', 'paypal']-- Use in models
WHERE created_at >= '{{ var("start_date") }}'
-- Pass at runtime
dbt run --vars '{"start_date": "2026-01-01"}'Summary
| Concept | Purpose |
|---------|---------|
| source() | Reference raw tables with freshness tracking |
| ref() | Reference other models, builds dependency DAG |
| Staging models | 1:1 with source, light cleaning only |
| Mart models | Business logic, joins, aggregations |
| view | No storage — fast to build |
| table | Full rebuild — for smaller marts |
| incremental | Append/upsert only — for large tables |
| ephemeral | CTE only — no table created |
| Seeds | Static lookup tables from CSV |
| schema.yml | Tests + documentation in one file |
| dbt build | Run models + tests in correct order |
Next: dbt testing, documentation, and macros for reusable logic.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.