BigQuery for Data Engineers · Lesson 2 of 3
BigQuery SQL: UNNEST, Analytics, Scripting & ML
BigQuery SQL: Not Quite Standard SQL
BigQuery uses GoogleSQL (formerly Standard SQL) — largely ANSI-compatible but with important extensions and a few behavioural differences. This lesson covers what makes BigQuery's dialect distinct and how to exploit it in production analytics.
UNNEST: Working with Arrays
In BigQuery, arrays are first-class citizens. UNNEST expands an array column into rows, enabling joins between an array and the rest of the row.
-- Basic UNNEST: one row per tag
SELECT
post_id,
title,
tag
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag;
-- UNNEST with offset to track array position
SELECT
post_id,
tag,
tag_position
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag WITH OFFSET AS tag_position
ORDER BY post_id, tag_position;
-- Filter: only posts that have the tag 'bigquery'
SELECT post_id, title
FROM `myproject.blog.posts`
WHERE 'bigquery' IN UNNEST(tags);
-- Build a histogram of tag frequency across all posts
SELECT
tag,
COUNT(*) AS post_count
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag
GROUP BY tag
ORDER BY post_count DESC
LIMIT 20;UNNEST on a STRUCT ARRAY
-- Schema: line_items ARRAY<STRUCT<product_id STRING, qty INT64, unit_price NUMERIC>>
SELECT
order_id,
item.product_id,
item.qty,
item.unit_price,
item.qty * item.unit_price AS line_total
FROM `myproject.sales.orders`,
UNNEST(line_items) AS item;STRUCT Access with Dot Notation
Navigate nested STRUCTs using . — no casting required.
-- address is STRUCT<street STRING, city STRING, country STRING, postal STRING>
SELECT
customer_id,
address.city,
address.country
FROM `myproject.crm.customers`
WHERE address.country = 'DE'
AND address.postal LIKE '1%';
-- Reconstruct a STRUCT with modified fields
SELECT
customer_id,
STRUCT(
address.street AS street,
address.city AS city,
UPPER(address.country) AS country, -- normalize country code
address.postal AS postal
) AS address
FROM `myproject.crm.customers`;SAFE Functions
Prefix any function with SAFE. to return NULL instead of raising an error on invalid input. Invaluable when cleaning messy ingested data.
-- SAFE.DIVIDE prevents division-by-zero errors
SELECT
product_id,
total_revenue,
order_count,
SAFE_DIVIDE(total_revenue, order_count) AS avg_order_value
FROM `myproject.sales.product_summary`;
-- SAFE.CAST avoids runtime errors on malformed strings
SELECT
row_id,
SAFE_CAST(raw_amount AS NUMERIC) AS amount,
SAFE_CAST(raw_date AS DATE) AS order_date
FROM `myproject.raw.staging_orders`;
-- SAFE.ST_GEOGFROMTEXT won't crash on invalid WKT strings
SELECT
location_id,
SAFE.ST_GEOGFROMTEXT(wkt_geometry) AS geo
FROM `myproject.spatial.raw_locations`;APPROX_COUNT_DISTINCT
Exact COUNT(DISTINCT ...) requires reading every value and maintaining a hash set. APPROX_COUNT_DISTINCT uses the HyperLogLog++ algorithm — 1 % error, 40× faster on large datasets.
-- Exact (expensive on billions of rows)
SELECT COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';
-- Approximate (cheap, fast, good enough for dashboards)
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';
-- Compare accuracy vs cost across 90 days
SELECT
event_date,
APPROX_COUNT_DISTINCT(user_id) AS approx_dau,
COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY event_date
ORDER BY event_date;JSON_EXTRACT_SCALAR and JSON Functions
-- JSON stored as STRING column
SELECT
event_id,
JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
CAST(JSON_EXTRACT_SCALAR(payload, '$.duration_ms') AS INT64) AS duration_ms,
JSON_EXTRACT_SCALAR(payload, '$.page.url') AS page_url
FROM `myproject.events.raw`;
-- Native JSON type (BigQuery JSON column)
SELECT
event_id,
STRING(payload.user_id) AS user_id,
INT64(payload.duration_ms) AS duration_ms,
STRING(payload.page.url) AS page_url
FROM `myproject.events.structured`;
-- Convert JSON array to BigQuery ARRAY
SELECT
session_id,
JSON_EXTRACT_ARRAY(features_json, '$') AS feature_list
FROM `myproject.ml.training_data`;Window Functions: ROWS vs RANGE Frames
Window functions perform calculations across a sliding window of rows. The frame clause controls which rows are included.
ROWS Frame (physical row positions)
-- 7-day rolling sum: exactly 7 preceding rows + current row
SELECT
event_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_revenue
FROM `myproject.sales.daily_summary`;RANGE Frame (logical value range)
-- RANGE-based frame: all rows with the same ORDER BY value
-- Useful when multiple rows share a timestamp
SELECT
user_id,
event_timestamp,
event_type,
SUM(revenue) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW
) AS last_7d_user_revenue
FROM `myproject.analytics.user_events`;Ranking and Distribution Functions
SELECT
product_id,
category,
revenue,
-- Dense ranking within category
DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS category_rank,
-- Percentile within all products
PERCENT_RANK() OVER (ORDER BY revenue) AS percentile,
-- Divide into 10 buckets (deciles)
NTILE(10) OVER (ORDER BY revenue DESC) AS revenue_decile,
-- Lead/lag for period-over-period
LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_revenue,
LEAD(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS next_month_revenue
FROM `myproject.sales.product_monthly`;WITH ROLLUP and WITH CUBE
GROUP BY ROLLUP computes subtotals at each grouping level. GROUP BY CUBE computes all possible subtotal combinations.
-- ROLLUP: subtotals per year, then per year+quarter, then grand total
SELECT
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(QUARTER FROM order_date) AS quarter,
region,
SUM(order_total) AS revenue
FROM `myproject.sales.orders`
GROUP BY ROLLUP(EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date), region)
ORDER BY year, quarter, region;
-- CUBE: every combination of dimensions
SELECT
region,
product_category,
channel,
SUM(revenue) AS total_revenue,
-- GROUPING() returns 1 for rows that are subtotals on that dimension
GROUPING(region) AS is_all_regions,
GROUPING(product_category) AS is_all_categories
FROM `myproject.sales.order_items`
GROUP BY CUBE(region, product_category, channel)
ORDER BY is_all_regions, is_all_categories, region, product_category;BigQuery Scripting
BigQuery supports multi-statement scripts executed as a single job. Use the console or bq query to run them.
DECLARE and SET
DECLARE start_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
DECLARE end_date DATE DEFAULT CURRENT_DATE();
DECLARE row_count INT64;
SET row_count = (
SELECT COUNT(*)
FROM `myproject.analytics.events`
WHERE event_date BETWEEN start_date AND end_date
);
SELECT CONCAT('Rows in range: ', CAST(row_count AS STRING)) AS message;IF / ELSE
DECLARE rows_loaded INT64;
SET rows_loaded = (SELECT COUNT(*) FROM `myproject.staging.today_load`);
IF rows_loaded = 0 THEN
RAISE USING MESSAGE = 'No rows loaded today — aborting pipeline';
ELSEIF rows_loaded < 1000 THEN
SELECT CONCAT('Warning: only ', CAST(rows_loaded AS STRING), ' rows loaded') AS warning;
ELSE
INSERT INTO `myproject.analytics.events`
SELECT * FROM `myproject.staging.today_load`;
SELECT CONCAT('Loaded ', CAST(rows_loaded AS STRING), ' rows successfully') AS status;
END IF;LOOP and WHILE
DECLARE i INT64 DEFAULT 0;
DECLARE batch_date DATE DEFAULT '2026-01-01';
WHILE batch_date <= '2026-05-07' DO
-- Backfill one day at a time to control memory usage
INSERT INTO `myproject.analytics.daily_agg`
SELECT
batch_date AS agg_date,
COUNT(*) AS events,
APPROX_COUNT_DISTINCT(user_id) AS unique_users
FROM `myproject.events.raw`
WHERE event_date = batch_date;
SET batch_date = DATE_ADD(batch_date, INTERVAL 1 DAY);
SET i = i + 1;
END WHILE;
SELECT CONCAT('Backfilled ', CAST(i AS STRING), ' days') AS result;Stored Procedures
Stored procedures are named, reusable scripts stored in a BigQuery dataset.
-- Create the procedure
CREATE OR REPLACE PROCEDURE `myproject.utils.refresh_daily_summary`(
IN target_date DATE,
OUT rows_inserted INT64
)
BEGIN
-- Delete existing data for the target date
DELETE FROM `myproject.analytics.daily_summary`
WHERE summary_date = target_date;
-- Insert fresh aggregation
INSERT INTO `myproject.analytics.daily_summary`
SELECT
target_date AS summary_date,
region,
COUNT(*) AS order_count,
SUM(order_total) AS revenue,
APPROX_COUNT_DISTINCT(user_id) AS unique_buyers
FROM `myproject.sales.orders`
WHERE DATE(order_timestamp) = target_date
GROUP BY region;
-- Return row count to caller
SET rows_inserted = (
SELECT COUNT(*) FROM `myproject.analytics.daily_summary`
WHERE summary_date = target_date
);
END;
-- Call the procedure
CALL `myproject.utils.refresh_daily_summary`(CURRENT_DATE(), @rows_out);
SELECT @rows_out AS rows_inserted;Scheduled Queries
Schedule a query to run on a cron-like schedule directly in BigQuery — no orchestration tool needed for simple use cases.
-- This SQL is the "scheduled query body"
-- Configure schedule via console or bq CLI:
-- bq mk --transfer_config --project_id=myproject \
-- --data_source=scheduled_query \
-- --display_name="Daily revenue refresh" \
-- --schedule="every 24 hours" \
-- --params='{"query":"..."}'
DECLARE today DATE DEFAULT DATE(CURRENT_TIMESTAMP(), 'America/New_York');
DELETE FROM `myproject.analytics.daily_summary` WHERE summary_date = today;
INSERT INTO `myproject.analytics.daily_summary`
SELECT
today AS summary_date,
region,
SUM(order_total) AS revenue
FROM `myproject.sales.orders`
WHERE DATE(order_timestamp, 'America/New_York') = today
GROUP BY region;BigQuery ML: Models Inside the Warehouse
BigQuery ML lets you train and deploy ML models using SQL — no Python or external infrastructure required.
Linear Regression
CREATE OR REPLACE MODEL `myproject.ml.revenue_forecast`
OPTIONS (
model_type = 'LINEAR_REG',
input_label_cols = ['revenue'],
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2
) AS
SELECT
EXTRACT(DAYOFWEEK FROM order_date) AS day_of_week,
EXTRACT(MONTH FROM order_date) AS month,
region,
marketing_spend,
revenue
FROM `myproject.sales.daily_by_region`
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31';
-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `myproject.ml.revenue_forecast`);
-- Run predictions
SELECT * FROM ML.PREDICT(
MODEL `myproject.ml.revenue_forecast`,
(
SELECT
EXTRACT(DAYOFWEEK FROM d) AS day_of_week,
EXTRACT(MONTH FROM d) AS month,
region,
marketing_spend
FROM UNNEST(GENERATE_DATE_ARRAY('2026-05-01', '2026-05-31')) AS d,
UNNEST(['NORTH', 'SOUTH', 'EAST', 'WEST']) AS region
CROSS JOIN (SELECT 50000 AS marketing_spend)
)
);Logistic Classification
CREATE OR REPLACE MODEL `myproject.ml.churn_predictor`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churned'],
l2_reg = 0.01
) AS
SELECT
days_since_signup,
total_orders,
avg_order_value,
days_since_last_order,
support_tickets,
churned
FROM `myproject.ml.customer_features`
WHERE partition_date = DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);Geospatial Queries with ST_ Functions
-- Find all stores within 10 km of a warehouse
SELECT
s.store_id,
s.store_name,
ROUND(ST_DISTANCE(s.location, w.location) / 1000, 2) AS distance_km
FROM `myproject.retail.stores` s
CROSS JOIN `myproject.logistics.warehouses` w
WHERE w.warehouse_id = 'WH-BERLIN'
AND ST_DWithin(s.location, w.location, 10000)
ORDER BY distance_km;
-- Calculate delivery zone coverage (store buffers)
SELECT
store_id,
ST_AREA(ST_BUFFER(location, 5000)) / 1e6 AS coverage_km2
FROM `myproject.retail.stores`;
-- Spatial join: which sales region does each event come from?
SELECT
e.event_id,
e.event_timestamp,
r.region_name
FROM `myproject.events.geotagged` e
JOIN `myproject.geo.sales_regions` r
ON ST_CONTAINS(r.boundary, e.location);
-- Cluster nearby events into groups
SELECT
ST_CLUSTERDBSCAN(location, 500, 5) OVER () AS cluster_id,
event_id,
event_timestamp
FROM `myproject.events.geotagged`
WHERE event_date = '2026-05-07';Wildcard Tables
Query multiple tables with a shared prefix using the * wildcard. Common for date-sharded table layouts (old pattern, still seen in many pipelines).
-- Query all daily shards for May 2026
SELECT
event_type,
COUNT(*) AS events
FROM `myproject.events.raw_2026_05_*`
GROUP BY event_type;
-- Use _TABLE_SUFFIX to filter specific shards
SELECT
_TABLE_SUFFIX AS shard_date,
COUNT(*) AS events
FROM `myproject.events.raw_2026_*`
WHERE _TABLE_SUFFIX BETWEEN '05_01' AND '05_07'
GROUP BY shard_date
ORDER BY shard_date;Prefer proper
PARTITION BYtables over date-sharded tables for new projects. Partitioned tables have better metadata, cost controls, and lifecycle management.
INFORMATION_SCHEMA for Cost Monitoring
-- Top 10 most expensive queries in the last 7 days
SELECT
user_email,
SUBSTR(query, 1, 100) AS query_preview,
ROUND(total_bytes_billed / POW(1024, 4), 4) AS tb_billed,
-- On-demand pricing: $6.25 per TB
ROUND(6.25 * total_bytes_billed / POW(1024, 4), 2) AS estimated_cost_usd,
ROUND(total_slot_ms / 3600000, 2) AS slot_hours,
creation_time
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_billed DESC
LIMIT 10;
-- Daily spend trend
SELECT
DATE(creation_time) AS query_date,
COUNT(*) AS query_count,
ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed,
ROUND(6.25 * SUM(total_bytes_billed) / POW(1024, 4), 2) AS estimated_daily_cost_usd
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
GROUP BY query_date
ORDER BY query_date;
-- Tables accessed most often (potential candidates for caching)
SELECT
referenced_table.dataset_id,
referenced_table.table_id,
COUNT(*) AS query_count,
ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`,
UNNEST(referenced_tables) AS referenced_table
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY referenced_table.dataset_id, referenced_table.table_id
ORDER BY query_count DESC
LIMIT 20;Complete Use Case: Cohort Retention Analysis
Cohort retention answers: "Of the users who first used the product in month X, how many were still active in months X+1, X+2, ...?"
This implementation uses BigQuery arrays and window functions to compute the full retention matrix in a single query.
-- Step 1: Identify each user's cohort (first activity month)
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC(MIN(event_date), MONTH) AS cohort_month
FROM `myproject.analytics.events`
WHERE event_date >= '2025-01-01'
AND event_type = 'session_start'
GROUP BY user_id
),
-- Step 2: Get all active months per user
user_activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC(event_date, MONTH) AS activity_month
FROM `myproject.analytics.events`
WHERE event_date >= '2025-01-01'
AND event_type = 'session_start'
),
-- Step 3: Join to get cohort month and period offset
cohort_activity AS (
SELECT
uc.cohort_month,
ua.activity_month,
DATE_DIFF(ua.activity_month, uc.cohort_month, MONTH) AS period_number,
uc.user_id
FROM user_cohorts uc
JOIN user_activity ua USING (user_id)
),
-- Step 4: Count cohort size (period 0 = cohort month itself)
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT user_id) AS cohort_size
FROM cohort_activity
WHERE period_number = 0
GROUP BY cohort_month
),
-- Step 5: Count retained users per (cohort, period)
retention_counts AS (
SELECT
cohort_month,
period_number,
COUNT(DISTINCT user_id) AS retained_users
FROM cohort_activity
GROUP BY cohort_month, period_number
)
-- Step 6: Compute retention rate and pivot into final matrix
SELECT
r.cohort_month,
s.cohort_size,
r.period_number,
r.retained_users,
ROUND(100.0 * r.retained_users / s.cohort_size, 1) AS retention_pct,
-- Array of retention rates for the cohort (for charting)
ARRAY_AGG(
ROUND(100.0 * r.retained_users / s.cohort_size, 1)
ORDER BY r.period_number
) OVER (PARTITION BY r.cohort_month) AS retention_curve
FROM retention_counts r
JOIN cohort_sizes s USING (cohort_month)
WHERE r.period_number BETWEEN 0 AND 11 -- first 12 months
ORDER BY r.cohort_month, r.period_number;Reading the Output
| cohort_month | cohort_size | period_number | retained_users | retention_pct | |---|---|---|---|---| | 2025-01-01 | 4,820 | 0 | 4,820 | 100.0 | | 2025-01-01 | 4,820 | 1 | 2,731 | 56.7 | | 2025-01-01 | 4,820 | 2 | 1,944 | 40.3 | | 2025-02-01 | 5,103 | 0 | 5,103 | 100.0 | | 2025-02-01 | 5,103 | 1 | 2,968 | 58.2 |
The retention_curve array column contains the full series [100.0, 56.7, 40.3, ...] per cohort — easy to pass to a charting library without further transformation.
Extending: Week-over-Week Cohorts
-- Change DATE_TRUNC interval from MONTH to WEEK for weekly cohorts
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC(MIN(event_date), WEEK(MONDAY)) AS cohort_week
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY user_id
)
-- ... rest of the query substituting cohort_week for cohort_month
-- and DATE_DIFF(..., WEEK) for DATE_DIFF(..., MONTH)Summary
BigQuery's SQL dialect rewards data engineers who understand its extensions:
- Use
UNNESTfor arrays — it is not optional, it is idiomatic BigQuery. - Prefer
APPROX_COUNT_DISTINCTover exactCOUNT(DISTINCT)in dashboards and monitoring queries. - Use
SAFE.prefix functions to harden ETL queries against dirty data. - Window functions with
RANGE BETWEEN INTERVALare perfect for time-series rolling calculations. INFORMATION_SCHEMA.JOBS_BY_PROJECTis your first stop when tracking down runaway query costs.
The next lesson covers the BigQuery Python client, pandas integration, and building production ETL pipelines with Airflow and dbt.