Back to blog
Data Engineeringbeginner

dbt Fundamentals: Models, Sources, Refs, and Materializations

Master dbt (data build tool) from scratch — project setup, SQL models, sources, refs, materializations, seeds, and the analytics engineering workflow used in modern data teams.

LearnixoMay 7, 20267 min read
dbtanalytics engineeringSQLdata transformationSnowflakeELT
Share:𝕏

What Is dbt?

dbt (data build tool) is the transformation layer in modern ELT pipelines. It lets you write SQL SELECT statements and handles everything else: dependency resolution, running them in the right order, testing, documentation, and lineage.

The dbt mental model:

Source data in warehouse ──► dbt (SQL models) ──► Analytics-ready tables

dbt does NOT extract or load data. It transforms data that is already in your warehouse (Snowflake, BigQuery, Redshift, DuckDB).


1. Installation and Project Setup

Bash
pip install dbt-snowflake   # or dbt-bigquery, dbt-redshift, dbt-duckdb

# Create a new project
dbt init my_project
cd my_project

Project structure

my_project/
├── dbt_project.yml          # project config
├── profiles.yml             # connection credentials (in ~/.dbt/)
├── models/
│   ├── staging/             # 1:1 with sources, light cleaning
│   │   ├── stg_orders.sql
│   │   └── schema.yml
│   ├── intermediate/        # joins, business logic (optional layer)
│   │   └── int_order_items.sql
│   └── marts/               # final business tables
│       ├── finance/
│       │   ├── fct_orders.sql
│       │   └── dim_customers.sql
│       └── schema.yml
├── seeds/                   # static CSV files loaded as tables
│   └── country_codes.csv
├── snapshots/               # SCD Type 2 history tables
│   └── orders_snapshot.sql
├── tests/                   # custom singular tests
│   └── assert_positive_revenue.sql
├── macros/                  # reusable Jinja functions
│   └── generate_surrogate_key.sql
└── analyses/                # ad-hoc SQL (not materialized)
    └── revenue_deep_dive.sql

dbt_project.yml

YAML
name: my_project
version: "1.0.0"
profile: my_profile

model-paths: ["models"]
seed-paths: ["seeds"]
snapshot-paths: ["snapshots"]
test-paths: ["tests"]
macro-paths: ["macros"]

models:
  my_project:
    staging:
      +materialized: view         # all staging models are views
      +schema: staging
    intermediate:
      +materialized: ephemeral    # no table created  inlined in queries
    marts:
      +materialized: table        # marts are persisted as tables
      finance:
        +schema: finance

profiles.yml (~/.dbt/profiles.yml)

YAML
my_profile:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      role: TRANSFORMER
      database: ANALYTICS
      warehouse: TRANSFORM_WH
      schema: DEV_YOURNAME     # each dev gets their own schema
      threads: 4

    prod:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      role: TRANSFORMER
      database: ANALYTICS
      warehouse: TRANSFORM_WH
      schema: MARTS
      threads: 8

2. Sources — Declaring Raw Data

Sources tell dbt where your raw data lives:

YAML
# models/staging/sources.yml
version: 2

sources:
  - name: raw
    database: ANALYTICS
    schema: RAW
    description: "Raw data from ingestion pipelines"
    freshness:
      warn_after: {count: 6,  period: hour}
      error_after: {count: 24, period: hour}

    tables:
      - name: orders
        description: "Raw orders from the orders API"
        loaded_at_field: _loaded_at    # column dbt uses for freshness
        columns:
          - name: order_id
            description: "Unique order identifier"
          - name: amount
            description: "Order amount (raw string from API)"
          - name: status

      - name: customers
        description: "Raw customer records"
        loaded_at_field: _loaded_at

      - name: products

Check freshness:

Bash
dbt source freshness

3. Your First Model

SQL
-- models/staging/stg_orders.sql
-- This model cleans raw orders: type casting, renaming, null handling

WITH source AS (
    SELECT *
    FROM {{ source('raw', 'orders') }}   -- references the source defined in sources.yml
),

renamed AS (
    SELECT
        order_id                                    AS order_id,
        customer_id                                 AS customer_id,
        product_id                                  AS product_id,
        TRY_CAST(quantity AS INTEGER)               AS quantity,
        TRY_CAST(amount AS DECIMAL(12, 2))          AS amount_usd,
        UPPER(TRIM(COALESCE(currency, 'USD')))       AS currency,
        LOWER(TRIM(status))                         AS status,
        TRY_TO_TIMESTAMP_TZ(created_at)             AS created_at,
        _loaded_at
    FROM source
    WHERE order_id IS NOT NULL
      AND amount IS NOT NULL
)

SELECT *
FROM renamed
WHERE amount_usd > 0

Run it:

Bash
dbt run --select stg_orders     # run one model
dbt run                          # run all models
dbt run --select staging.*       # run all staging models
dbt run --select +fct_orders     # run fct_orders and all its upstream deps
dbt run --select fct_orders+     # run fct_orders and all downstream

4. ref() — The Core of dbt Dependency Management

ref() is how models reference other models. dbt builds a DAG from these references:

SQL
-- models/staging/stg_customers.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'customers') }}
)
SELECT
    customer_id,
    TRIM(LOWER(email))  AS email,
    TRIM(full_name)     AS full_name,
    country_code,
    created_at
FROM source
WHERE customer_id IS NOT NULL
SQL
-- models/marts/finance/fct_orders.sql
-- References stg_orders and stg_customers via ref()

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}         -- ref to staging model
),

customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

products AS (
    SELECT * FROM {{ ref('stg_products') }}
)

SELECT
    o.order_id,
    o.customer_id,
    c.email                             AS customer_email,
    c.country_code,
    o.product_id,
    p.category                          AS product_category,
    o.quantity,
    o.amount_usd,
    o.amount_usd * o.quantity           AS line_total,
    o.status,
    o.created_at,
    DATE_TRUNC('month', o.created_at)   AS order_month
FROM orders         o
JOIN customers      c  ON o.customer_id = c.customer_id
JOIN products       p  ON o.product_id  = p.product_id

dbt knows fct_orders depends on stg_orders, stg_customers, and stg_products. It always builds them first.


5. Materializations

How dbt persists a model in the warehouse:

| Materialization | What it creates | When to use | |----------------|----------------|-------------| | view | SQL view — no data stored | Staging models, lightweight transforms | | table | Physical table — full rebuild each run | Medium tables, final marts | | incremental | Append/upsert only new rows | Large tables (millions of rows) | | ephemeral | No table — inlined as CTE | Reusable logic, not queried directly |

Incremental models — the most important

SQL
-- models/marts/finance/fct_orders.sql
-- Only processes new records on each run

{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        on_schema_change='sync_all_columns',
    )
}}

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}

    {% if is_incremental() %}
        -- Only process rows newer than the latest record in the target
        WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
    {% endif %}
),

...

SELECT ...
FROM orders
JOIN ...
Bash
dbt run --select fct_orders            # incremental run (normal)
dbt run --select fct_orders --full-refresh  # rebuild from scratch

6. Seeds — Static CSV Data as Tables

Bash
# seeds/country_codes.csv
country_code,country_name,region
US,United States,Americas
GB,United Kingdom,Europe
NO,Norway,Europe
DE,Germany,Europe
Bash
dbt seed                    # loads seeds to warehouse as tables
dbt seed --select country_codes

Reference in models:

SQL
SELECT o.*, c.country_name, c.region
FROM {{ ref('fct_orders') }} o
JOIN {{ ref('country_codes') }} c ON o.country_code = c.country_code

7. Schema.yml — Document and Test in One File

YAML
# models/marts/finance/schema.yml
version: 2

models:
  - name: fct_orders
    description: >
      One row per completed order. Joins orders, customers, and products.
      Primary grain is order_id. Updated incrementally.

    columns:
      - name: order_id
        description: "Unique order identifier"
        tests:
          - not_null
          - unique

      - name: customer_id
        description: "FK to dim_customers"
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id

      - name: amount_usd
        description: "Order value in USD"
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

      - name: status
        tests:
          - accepted_values:
              values: [pending, processing, shipped, delivered, cancelled]

      - name: created_at
        tests:
          - not_null

8. Running and Key Commands

Bash
# Development workflow
dbt debug                    # test connection
dbt compile                  # compile SQL without running
dbt run                      # run all models
dbt test                     # run all tests
dbt build                    # run + test in one command (recommended)
dbt docs generate            # generate documentation
dbt docs serve               # serve docs at localhost:8080

# Selective runs
dbt run --select stg_orders
dbt run --select staging.*
dbt run --select +fct_orders        # fct_orders + all upstream
dbt run --select fct_orders+        # fct_orders + all downstream
dbt run --select tag:finance        # models with tag:finance
dbt run --exclude stg_products

# In production (CI/CD)
dbt build --target prod             # use prod profile
dbt build --select state:modified+  # only changed models + their downstream

9. Variables and Environments

YAML
# dbt_project.yml
vars:
  start_date: "2024-01-01"
  default_schema: "analytics"
  payment_methods: ['credit_card', 'bank_transfer', 'paypal']
SQL
-- Use in models
WHERE created_at >= '{{ var("start_date") }}'

-- Pass at runtime
dbt run --vars '{"start_date": "2026-01-01"}'

Summary

| Concept | Purpose | |---------|---------| | source() | Reference raw tables with freshness tracking | | ref() | Reference other models, builds dependency DAG | | Staging models | 1:1 with source, light cleaning only | | Mart models | Business logic, joins, aggregations | | view | No storage — fast to build | | table | Full rebuild — for smaller marts | | incremental | Append/upsert only — for large tables | | ephemeral | CTE only — no table created | | Seeds | Static lookup tables from CSV | | schema.yml | Tests + documentation in one file | | dbt build | Run models + tests in correct order |

Next: dbt testing, documentation, and macros for reusable logic.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.