dbt: Analytics Engineering · Lesson 1 of 4
dbt Fundamentals: Models, Sources, Refs, and Materializations
What Is dbt?
dbt (data build tool) is the transformation layer in modern ELT pipelines. It lets you write SQL SELECT statements and handles everything else: dependency resolution, running them in the right order, testing, documentation, and lineage.
The dbt mental model:
Source data in warehouse ──► dbt (SQL models) ──► Analytics-ready tablesdbt does NOT extract or load data. It transforms data that is already in your warehouse (Snowflake, BigQuery, Redshift, DuckDB).
1. Installation and Project Setup
pip install dbt-snowflake # or dbt-bigquery, dbt-redshift, dbt-duckdb
# Create a new project
dbt init my_project
cd my_projectProject structure
my_project/
├── dbt_project.yml # project config
├── profiles.yml # connection credentials (in ~/.dbt/)
├── models/
│ ├── staging/ # 1:1 with sources, light cleaning
│ │ ├── stg_orders.sql
│ │ └── schema.yml
│ ├── intermediate/ # joins, business logic (optional layer)
│ │ └── int_order_items.sql
│ └── marts/ # final business tables
│ ├── finance/
│ │ ├── fct_orders.sql
│ │ └── dim_customers.sql
│ └── schema.yml
├── seeds/ # static CSV files loaded as tables
│ └── country_codes.csv
├── snapshots/ # SCD Type 2 history tables
│ └── orders_snapshot.sql
├── tests/ # custom singular tests
│ └── assert_positive_revenue.sql
├── macros/ # reusable Jinja functions
│ └── generate_surrogate_key.sql
└── analyses/ # ad-hoc SQL (not materialized)
└── revenue_deep_dive.sqldbt_project.yml
name: my_project
version: "1.0.0"
profile: my_profile
model-paths: ["models"]
seed-paths: ["seeds"]
snapshot-paths: ["snapshots"]
test-paths: ["tests"]
macro-paths: ["macros"]
models:
my_project:
staging:
+materialized: view # all staging models are views
+schema: staging
intermediate:
+materialized: ephemeral # no table created — inlined in queries
marts:
+materialized: table # marts are persisted as tables
finance:
+schema: financeprofiles.yml (~/.dbt/profiles.yml)
my_profile:
target: dev
outputs:
dev:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: TRANSFORMER
database: ANALYTICS
warehouse: TRANSFORM_WH
schema: DEV_YOURNAME # each dev gets their own schema
threads: 4
prod:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: TRANSFORMER
database: ANALYTICS
warehouse: TRANSFORM_WH
schema: MARTS
threads: 82. Sources — Declaring Raw Data
Sources tell dbt where your raw data lives:
# models/staging/sources.yml
version: 2
sources:
- name: raw
database: ANALYTICS
schema: RAW
description: "Raw data from ingestion pipelines"
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "Raw orders from the orders API"
loaded_at_field: _loaded_at # column dbt uses for freshness
columns:
- name: order_id
description: "Unique order identifier"
- name: amount
description: "Order amount (raw string from API)"
- name: status
- name: customers
description: "Raw customer records"
loaded_at_field: _loaded_at
- name: productsCheck freshness:
dbt source freshness3. Your First Model
-- models/staging/stg_orders.sql
-- This model cleans raw orders: type casting, renaming, null handling
WITH source AS (
SELECT *
FROM {{ source('raw', 'orders') }} -- references the source defined in sources.yml
),
renamed AS (
SELECT
order_id AS order_id,
customer_id AS customer_id,
product_id AS product_id,
TRY_CAST(quantity AS INTEGER) AS quantity,
TRY_CAST(amount AS DECIMAL(12, 2)) AS amount_usd,
UPPER(TRIM(COALESCE(currency, 'USD'))) AS currency,
LOWER(TRIM(status)) AS status,
TRY_TO_TIMESTAMP_TZ(created_at) AS created_at,
_loaded_at
FROM source
WHERE order_id IS NOT NULL
AND amount IS NOT NULL
)
SELECT *
FROM renamed
WHERE amount_usd > 0Run it:
dbt run --select stg_orders # run one model
dbt run # run all models
dbt run --select staging.* # run all staging models
dbt run --select +fct_orders # run fct_orders and all its upstream deps
dbt run --select fct_orders+ # run fct_orders and all downstream4. ref() — The Core of dbt Dependency Management
ref() is how models reference other models. dbt builds a DAG from these references:
-- models/staging/stg_customers.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'customers') }}
)
SELECT
customer_id,
TRIM(LOWER(email)) AS email,
TRIM(full_name) AS full_name,
country_code,
created_at
FROM source
WHERE customer_id IS NOT NULL-- models/marts/finance/fct_orders.sql
-- References stg_orders and stg_customers via ref()
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }} -- ref to staging model
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
products AS (
SELECT * FROM {{ ref('stg_products') }}
)
SELECT
o.order_id,
o.customer_id,
c.email AS customer_email,
c.country_code,
o.product_id,
p.category AS product_category,
o.quantity,
o.amount_usd,
o.amount_usd * o.quantity AS line_total,
o.status,
o.created_at,
DATE_TRUNC('month', o.created_at) AS order_month
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_iddbt knows fct_orders depends on stg_orders, stg_customers, and stg_products. It always builds them first.
5. Materializations
How dbt persists a model in the warehouse:
| Materialization | What it creates | When to use |
|----------------|----------------|-------------|
| view | SQL view — no data stored | Staging models, lightweight transforms |
| table | Physical table — full rebuild each run | Medium tables, final marts |
| incremental | Append/upsert only new rows | Large tables (millions of rows) |
| ephemeral | No table — inlined as CTE | Reusable logic, not queried directly |
Incremental models — the most important
-- models/marts/finance/fct_orders.sql
-- Only processes new records on each run
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
on_schema_change='sync_all_columns',
)
}}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
-- Only process rows newer than the latest record in the target
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
),
...
SELECT ...
FROM orders
JOIN ...dbt run --select fct_orders # incremental run (normal)
dbt run --select fct_orders --full-refresh # rebuild from scratch6. Seeds — Static CSV Data as Tables
# seeds/country_codes.csv
country_code,country_name,region
US,United States,Americas
GB,United Kingdom,Europe
NO,Norway,Europe
DE,Germany,Europedbt seed # loads seeds to warehouse as tables
dbt seed --select country_codesReference in models:
SELECT o.*, c.country_name, c.region
FROM {{ ref('fct_orders') }} o
JOIN {{ ref('country_codes') }} c ON o.country_code = c.country_code7. Schema.yml — Document and Test in One File
# models/marts/finance/schema.yml
version: 2
models:
- name: fct_orders
description: >
One row per completed order. Joins orders, customers, and products.
Primary grain is order_id. Updated incrementally.
columns:
- name: order_id
description: "Unique order identifier"
tests:
- not_null
- unique
- name: customer_id
description: "FK to dim_customers"
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: amount_usd
description: "Order value in USD"
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: status
tests:
- accepted_values:
values: [pending, processing, shipped, delivered, cancelled]
- name: created_at
tests:
- not_null8. Running and Key Commands
# Development workflow
dbt debug # test connection
dbt compile # compile SQL without running
dbt run # run all models
dbt test # run all tests
dbt build # run + test in one command (recommended)
dbt docs generate # generate documentation
dbt docs serve # serve docs at localhost:8080
# Selective runs
dbt run --select stg_orders
dbt run --select staging.*
dbt run --select +fct_orders # fct_orders + all upstream
dbt run --select fct_orders+ # fct_orders + all downstream
dbt run --select tag:finance # models with tag:finance
dbt run --exclude stg_products
# In production (CI/CD)
dbt build --target prod # use prod profile
dbt build --select state:modified+ # only changed models + their downstream9. Variables and Environments
# dbt_project.yml
vars:
start_date: "2024-01-01"
default_schema: "analytics"
payment_methods: ['credit_card', 'bank_transfer', 'paypal']-- Use in models
WHERE created_at >= '{{ var("start_date") }}'
-- Pass at runtime
dbt run --vars '{"start_date": "2026-01-01"}'Summary
| Concept | Purpose |
|---------|---------|
| source() | Reference raw tables with freshness tracking |
| ref() | Reference other models, builds dependency DAG |
| Staging models | 1:1 with source, light cleaning only |
| Mart models | Business logic, joins, aggregations |
| view | No storage — fast to build |
| table | Full rebuild — for smaller marts |
| incremental | Append/upsert only — for large tables |
| ephemeral | CTE only — no table created |
| Seeds | Static lookup tables from CSV |
| schema.yml | Tests + documentation in one file |
| dbt build | Run models + tests in correct order |
Next: dbt testing, documentation, and macros for reusable logic.