BigQuery SQL Analytics: Advanced Patterns for Data Engineers
Go beyond basic SELECT — master BigQuery's SQL dialect including UNNEST, window functions, scripting, stored procedures, BigQuery ML, geospatial queries, and a complete cohort retention analysis.
BigQuery SQL: Not Quite Standard SQL
BigQuery uses GoogleSQL (formerly Standard SQL) — largely ANSI-compatible but with important extensions and a few behavioural differences. This lesson covers what makes BigQuery's dialect distinct and how to exploit it in production analytics.
UNNEST: Working with Arrays
In BigQuery, arrays are first-class citizens. UNNEST expands an array column into rows, enabling joins between an array and the rest of the row.
-- Basic UNNEST: one row per tag
SELECT
post_id,
title,
tag
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag;
-- UNNEST with offset to track array position
SELECT
post_id,
tag,
tag_position
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag WITH OFFSET AS tag_position
ORDER BY post_id, tag_position;
-- Filter: only posts that have the tag 'bigquery'
SELECT post_id, title
FROM `myproject.blog.posts`
WHERE 'bigquery' IN UNNEST(tags);
-- Build a histogram of tag frequency across all posts
SELECT
tag,
COUNT(*) AS post_count
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag
GROUP BY tag
ORDER BY post_count DESC
LIMIT 20;UNNEST on a STRUCT ARRAY
-- Schema: line_items ARRAY<STRUCT<product_id STRING, qty INT64, unit_price NUMERIC>>
SELECT
order_id,
item.product_id,
item.qty,
item.unit_price,
item.qty * item.unit_price AS line_total
FROM `myproject.sales.orders`,
UNNEST(line_items) AS item;STRUCT Access with Dot Notation
Navigate nested STRUCTs using . — no casting required.
-- address is STRUCT<street STRING, city STRING, country STRING, postal STRING>
SELECT
customer_id,
address.city,
address.country
FROM `myproject.crm.customers`
WHERE address.country = 'DE'
AND address.postal LIKE '1%';
-- Reconstruct a STRUCT with modified fields
SELECT
customer_id,
STRUCT(
address.street AS street,
address.city AS city,
UPPER(address.country) AS country, -- normalize country code
address.postal AS postal
) AS address
FROM `myproject.crm.customers`;SAFE Functions
Prefix any function with SAFE. to return NULL instead of raising an error on invalid input. Invaluable when cleaning messy ingested data.
-- SAFE.DIVIDE prevents division-by-zero errors
SELECT
product_id,
total_revenue,
order_count,
SAFE_DIVIDE(total_revenue, order_count) AS avg_order_value
FROM `myproject.sales.product_summary`;
-- SAFE.CAST avoids runtime errors on malformed strings
SELECT
row_id,
SAFE_CAST(raw_amount AS NUMERIC) AS amount,
SAFE_CAST(raw_date AS DATE) AS order_date
FROM `myproject.raw.staging_orders`;
-- SAFE.ST_GEOGFROMTEXT won't crash on invalid WKT strings
SELECT
location_id,
SAFE.ST_GEOGFROMTEXT(wkt_geometry) AS geo
FROM `myproject.spatial.raw_locations`;APPROX_COUNT_DISTINCT
Exact COUNT(DISTINCT ...) requires reading every value and maintaining a hash set. APPROX_COUNT_DISTINCT uses the HyperLogLog++ algorithm — 1 % error, 40× faster on large datasets.
-- Exact (expensive on billions of rows)
SELECT COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';
-- Approximate (cheap, fast, good enough for dashboards)
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';
-- Compare accuracy vs cost across 90 days
SELECT
event_date,
APPROX_COUNT_DISTINCT(user_id) AS approx_dau,
COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY event_date
ORDER BY event_date;JSON_EXTRACT_SCALAR and JSON Functions
-- JSON stored as STRING column
SELECT
event_id,
JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
CAST(JSON_EXTRACT_SCALAR(payload, '$.duration_ms') AS INT64) AS duration_ms,
JSON_EXTRACT_SCALAR(payload, '$.page.url') AS page_url
FROM `myproject.events.raw`;
-- Native JSON type (BigQuery JSON column)
SELECT
event_id,
STRING(payload.user_id) AS user_id,
INT64(payload.duration_ms) AS duration_ms,
STRING(payload.page.url) AS page_url
FROM `myproject.events.structured`;
-- Convert JSON array to BigQuery ARRAY
SELECT
session_id,
JSON_EXTRACT_ARRAY(features_json, '$') AS feature_list
FROM `myproject.ml.training_data`;Window Functions: ROWS vs RANGE Frames
Window functions perform calculations across a sliding window of rows. The frame clause controls which rows are included.
ROWS Frame (physical row positions)
-- 7-day rolling sum: exactly 7 preceding rows + current row
SELECT
event_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_revenue
FROM `myproject.sales.daily_summary`;RANGE Frame (logical value range)
-- RANGE-based frame: all rows with the same ORDER BY value
-- Useful when multiple rows share a timestamp
SELECT
user_id,
event_timestamp,
event_type,
SUM(revenue) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW
) AS last_7d_user_revenue
FROM `myproject.analytics.user_events`;Ranking and Distribution Functions
SELECT
product_id,
category,
revenue,
-- Dense ranking within category
DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS category_rank,
-- Percentile within all products
PERCENT_RANK() OVER (ORDER BY revenue) AS percentile,
-- Divide into 10 buckets (deciles)
NTILE(10) OVER (ORDER BY revenue DESC) AS revenue_decile,
-- Lead/lag for period-over-period
LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_revenue,
LEAD(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS next_month_revenue
FROM `myproject.sales.product_monthly`;WITH ROLLUP and WITH CUBE
GROUP BY ROLLUP computes subtotals at each grouping level. GROUP BY CUBE computes all possible subtotal combinations.
-- ROLLUP: subtotals per year, then per year+quarter, then grand total
SELECT
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(QUARTER FROM order_date) AS quarter,
region,
SUM(order_total) AS revenue
FROM `myproject.sales.orders`
GROUP BY ROLLUP(EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date), region)
ORDER BY year, quarter, region;
-- CUBE: every combination of dimensions
SELECT
region,
product_category,
channel,
SUM(revenue) AS total_revenue,
-- GROUPING() returns 1 for rows that are subtotals on that dimension
GROUPING(region) AS is_all_regions,
GROUPING(product_category) AS is_all_categories
FROM `myproject.sales.order_items`
GROUP BY CUBE(region, product_category, channel)
ORDER BY is_all_regions, is_all_categories, region, product_category;BigQuery Scripting
BigQuery supports multi-statement scripts executed as a single job. Use the console or bq query to run them.
DECLARE and SET
DECLARE start_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
DECLARE end_date DATE DEFAULT CURRENT_DATE();
DECLARE row_count INT64;
SET row_count = (
SELECT COUNT(*)
FROM `myproject.analytics.events`
WHERE event_date BETWEEN start_date AND end_date
);
SELECT CONCAT('Rows in range: ', CAST(row_count AS STRING)) AS message;IF / ELSE
DECLARE rows_loaded INT64;
SET rows_loaded = (SELECT COUNT(*) FROM `myproject.staging.today_load`);
IF rows_loaded = 0 THEN
RAISE USING MESSAGE = 'No rows loaded today — aborting pipeline';
ELSEIF rows_loaded < 1000 THEN
SELECT CONCAT('Warning: only ', CAST(rows_loaded AS STRING), ' rows loaded') AS warning;
ELSE
INSERT INTO `myproject.analytics.events`
SELECT * FROM `myproject.staging.today_load`;
SELECT CONCAT('Loaded ', CAST(rows_loaded AS STRING), ' rows successfully') AS status;
END IF;LOOP and WHILE
DECLARE i INT64 DEFAULT 0;
DECLARE batch_date DATE DEFAULT '2026-01-01';
WHILE batch_date <= '2026-05-07' DO
-- Backfill one day at a time to control memory usage
INSERT INTO `myproject.analytics.daily_agg`
SELECT
batch_date AS agg_date,
COUNT(*) AS events,
APPROX_COUNT_DISTINCT(user_id) AS unique_users
FROM `myproject.events.raw`
WHERE event_date = batch_date;
SET batch_date = DATE_ADD(batch_date, INTERVAL 1 DAY);
SET i = i + 1;
END WHILE;
SELECT CONCAT('Backfilled ', CAST(i AS STRING), ' days') AS result;Stored Procedures
Stored procedures are named, reusable scripts stored in a BigQuery dataset.
-- Create the procedure
CREATE OR REPLACE PROCEDURE `myproject.utils.refresh_daily_summary`(
IN target_date DATE,
OUT rows_inserted INT64
)
BEGIN
-- Delete existing data for the target date
DELETE FROM `myproject.analytics.daily_summary`
WHERE summary_date = target_date;
-- Insert fresh aggregation
INSERT INTO `myproject.analytics.daily_summary`
SELECT
target_date AS summary_date,
region,
COUNT(*) AS order_count,
SUM(order_total) AS revenue,
APPROX_COUNT_DISTINCT(user_id) AS unique_buyers
FROM `myproject.sales.orders`
WHERE DATE(order_timestamp) = target_date
GROUP BY region;
-- Return row count to caller
SET rows_inserted = (
SELECT COUNT(*) FROM `myproject.analytics.daily_summary`
WHERE summary_date = target_date
);
END;
-- Call the procedure
CALL `myproject.utils.refresh_daily_summary`(CURRENT_DATE(), @rows_out);
SELECT @rows_out AS rows_inserted;Scheduled Queries
Schedule a query to run on a cron-like schedule directly in BigQuery — no orchestration tool needed for simple use cases.
-- This SQL is the "scheduled query body"
-- Configure schedule via console or bq CLI:
-- bq mk --transfer_config --project_id=myproject \
-- --data_source=scheduled_query \
-- --display_name="Daily revenue refresh" \
-- --schedule="every 24 hours" \
-- --params='{"query":"..."}'
DECLARE today DATE DEFAULT DATE(CURRENT_TIMESTAMP(), 'America/New_York');
DELETE FROM `myproject.analytics.daily_summary` WHERE summary_date = today;
INSERT INTO `myproject.analytics.daily_summary`
SELECT
today AS summary_date,
region,
SUM(order_total) AS revenue
FROM `myproject.sales.orders`
WHERE DATE(order_timestamp, 'America/New_York') = today
GROUP BY region;BigQuery ML: Models Inside the Warehouse
BigQuery ML lets you train and deploy ML models using SQL — no Python or external infrastructure required.
Linear Regression
CREATE OR REPLACE MODEL `myproject.ml.revenue_forecast`
OPTIONS (
model_type = 'LINEAR_REG',
input_label_cols = ['revenue'],
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2
) AS
SELECT
EXTRACT(DAYOFWEEK FROM order_date) AS day_of_week,
EXTRACT(MONTH FROM order_date) AS month,
region,
marketing_spend,
revenue
FROM `myproject.sales.daily_by_region`
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31';
-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `myproject.ml.revenue_forecast`);
-- Run predictions
SELECT * FROM ML.PREDICT(
MODEL `myproject.ml.revenue_forecast`,
(
SELECT
EXTRACT(DAYOFWEEK FROM d) AS day_of_week,
EXTRACT(MONTH FROM d) AS month,
region,
marketing_spend
FROM UNNEST(GENERATE_DATE_ARRAY('2026-05-01', '2026-05-31')) AS d,
UNNEST(['NORTH', 'SOUTH', 'EAST', 'WEST']) AS region
CROSS JOIN (SELECT 50000 AS marketing_spend)
)
);Logistic Classification
CREATE OR REPLACE MODEL `myproject.ml.churn_predictor`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churned'],
l2_reg = 0.01
) AS
SELECT
days_since_signup,
total_orders,
avg_order_value,
days_since_last_order,
support_tickets,
churned
FROM `myproject.ml.customer_features`
WHERE partition_date = DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);Geospatial Queries with ST_ Functions
-- Find all stores within 10 km of a warehouse
SELECT
s.store_id,
s.store_name,
ROUND(ST_DISTANCE(s.location, w.location) / 1000, 2) AS distance_km
FROM `myproject.retail.stores` s
CROSS JOIN `myproject.logistics.warehouses` w
WHERE w.warehouse_id = 'WH-BERLIN'
AND ST_DWithin(s.location, w.location, 10000)
ORDER BY distance_km;
-- Calculate delivery zone coverage (store buffers)
SELECT
store_id,
ST_AREA(ST_BUFFER(location, 5000)) / 1e6 AS coverage_km2
FROM `myproject.retail.stores`;
-- Spatial join: which sales region does each event come from?
SELECT
e.event_id,
e.event_timestamp,
r.region_name
FROM `myproject.events.geotagged` e
JOIN `myproject.geo.sales_regions` r
ON ST_CONTAINS(r.boundary, e.location);
-- Cluster nearby events into groups
SELECT
ST_CLUSTERDBSCAN(location, 500, 5) OVER () AS cluster_id,
event_id,
event_timestamp
FROM `myproject.events.geotagged`
WHERE event_date = '2026-05-07';Wildcard Tables
Query multiple tables with a shared prefix using the * wildcard. Common for date-sharded table layouts (old pattern, still seen in many pipelines).
-- Query all daily shards for May 2026
SELECT
event_type,
COUNT(*) AS events
FROM `myproject.events.raw_2026_05_*`
GROUP BY event_type;
-- Use _TABLE_SUFFIX to filter specific shards
SELECT
_TABLE_SUFFIX AS shard_date,
COUNT(*) AS events
FROM `myproject.events.raw_2026_*`
WHERE _TABLE_SUFFIX BETWEEN '05_01' AND '05_07'
GROUP BY shard_date
ORDER BY shard_date;Prefer proper
PARTITION BYtables over date-sharded tables for new projects. Partitioned tables have better metadata, cost controls, and lifecycle management.
INFORMATION_SCHEMA for Cost Monitoring
-- Top 10 most expensive queries in the last 7 days
SELECT
user_email,
SUBSTR(query, 1, 100) AS query_preview,
ROUND(total_bytes_billed / POW(1024, 4), 4) AS tb_billed,
-- On-demand pricing: $6.25 per TB
ROUND(6.25 * total_bytes_billed / POW(1024, 4), 2) AS estimated_cost_usd,
ROUND(total_slot_ms / 3600000, 2) AS slot_hours,
creation_time
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_billed DESC
LIMIT 10;
-- Daily spend trend
SELECT
DATE(creation_time) AS query_date,
COUNT(*) AS query_count,
ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed,
ROUND(6.25 * SUM(total_bytes_billed) / POW(1024, 4), 2) AS estimated_daily_cost_usd
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
GROUP BY query_date
ORDER BY query_date;
-- Tables accessed most often (potential candidates for caching)
SELECT
referenced_table.dataset_id,
referenced_table.table_id,
COUNT(*) AS query_count,
ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`,
UNNEST(referenced_tables) AS referenced_table
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY referenced_table.dataset_id, referenced_table.table_id
ORDER BY query_count DESC
LIMIT 20;Complete Use Case: Cohort Retention Analysis
Cohort retention answers: "Of the users who first used the product in month X, how many were still active in months X+1, X+2, ...?"
This implementation uses BigQuery arrays and window functions to compute the full retention matrix in a single query.
-- Step 1: Identify each user's cohort (first activity month)
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC(MIN(event_date), MONTH) AS cohort_month
FROM `myproject.analytics.events`
WHERE event_date >= '2025-01-01'
AND event_type = 'session_start'
GROUP BY user_id
),
-- Step 2: Get all active months per user
user_activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC(event_date, MONTH) AS activity_month
FROM `myproject.analytics.events`
WHERE event_date >= '2025-01-01'
AND event_type = 'session_start'
),
-- Step 3: Join to get cohort month and period offset
cohort_activity AS (
SELECT
uc.cohort_month,
ua.activity_month,
DATE_DIFF(ua.activity_month, uc.cohort_month, MONTH) AS period_number,
uc.user_id
FROM user_cohorts uc
JOIN user_activity ua USING (user_id)
),
-- Step 4: Count cohort size (period 0 = cohort month itself)
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT user_id) AS cohort_size
FROM cohort_activity
WHERE period_number = 0
GROUP BY cohort_month
),
-- Step 5: Count retained users per (cohort, period)
retention_counts AS (
SELECT
cohort_month,
period_number,
COUNT(DISTINCT user_id) AS retained_users
FROM cohort_activity
GROUP BY cohort_month, period_number
)
-- Step 6: Compute retention rate and pivot into final matrix
SELECT
r.cohort_month,
s.cohort_size,
r.period_number,
r.retained_users,
ROUND(100.0 * r.retained_users / s.cohort_size, 1) AS retention_pct,
-- Array of retention rates for the cohort (for charting)
ARRAY_AGG(
ROUND(100.0 * r.retained_users / s.cohort_size, 1)
ORDER BY r.period_number
) OVER (PARTITION BY r.cohort_month) AS retention_curve
FROM retention_counts r
JOIN cohort_sizes s USING (cohort_month)
WHERE r.period_number BETWEEN 0 AND 11 -- first 12 months
ORDER BY r.cohort_month, r.period_number;Reading the Output
| cohort_month | cohort_size | period_number | retained_users | retention_pct | |---|---|---|---|---| | 2025-01-01 | 4,820 | 0 | 4,820 | 100.0 | | 2025-01-01 | 4,820 | 1 | 2,731 | 56.7 | | 2025-01-01 | 4,820 | 2 | 1,944 | 40.3 | | 2025-02-01 | 5,103 | 0 | 5,103 | 100.0 | | 2025-02-01 | 5,103 | 1 | 2,968 | 58.2 |
The retention_curve array column contains the full series [100.0, 56.7, 40.3, ...] per cohort — easy to pass to a charting library without further transformation.
Extending: Week-over-Week Cohorts
-- Change DATE_TRUNC interval from MONTH to WEEK for weekly cohorts
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC(MIN(event_date), WEEK(MONDAY)) AS cohort_week
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY user_id
)
-- ... rest of the query substituting cohort_week for cohort_month
-- and DATE_DIFF(..., WEEK) for DATE_DIFF(..., MONTH)Summary
BigQuery's SQL dialect rewards data engineers who understand its extensions:
- Use
UNNESTfor arrays — it is not optional, it is idiomatic BigQuery. - Prefer
APPROX_COUNT_DISTINCTover exactCOUNT(DISTINCT)in dashboards and monitoring queries. - Use
SAFE.prefix functions to harden ETL queries against dirty data. - Window functions with
RANGE BETWEEN INTERVALare perfect for time-series rolling calculations. INFORMATION_SCHEMA.JOBS_BY_PROJECTis your first stop when tracking down runaway query costs.
The next lesson covers the BigQuery Python client, pandas integration, and building production ETL pipelines with Airflow and dbt.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.