Back to blog
Data Engineeringintermediate

BigQuery SQL Analytics: Advanced Patterns for Data Engineers

Go beyond basic SELECT — master BigQuery's SQL dialect including UNNEST, window functions, scripting, stored procedures, BigQuery ML, geospatial queries, and a complete cohort retention analysis.

LearnixoMay 7, 202612 min read
BigQuerySQLAnalyticsWindow FunctionsBigQuery MLGeospatialData Engineering
Share:𝕏

BigQuery SQL: Not Quite Standard SQL

BigQuery uses GoogleSQL (formerly Standard SQL) — largely ANSI-compatible but with important extensions and a few behavioural differences. This lesson covers what makes BigQuery's dialect distinct and how to exploit it in production analytics.

UNNEST: Working with Arrays

In BigQuery, arrays are first-class citizens. UNNEST expands an array column into rows, enabling joins between an array and the rest of the row.

SQL
-- Basic UNNEST: one row per tag
SELECT
    post_id,
    title,
    tag
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag;

-- UNNEST with offset to track array position
SELECT
    post_id,
    tag,
    tag_position
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag WITH OFFSET AS tag_position
ORDER BY post_id, tag_position;

-- Filter: only posts that have the tag 'bigquery'
SELECT post_id, title
FROM `myproject.blog.posts`
WHERE 'bigquery' IN UNNEST(tags);

-- Build a histogram of tag frequency across all posts
SELECT
    tag,
    COUNT(*) AS post_count
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag
GROUP BY tag
ORDER BY post_count DESC
LIMIT 20;

UNNEST on a STRUCT ARRAY

SQL
-- Schema: line_items ARRAY<STRUCT<product_id STRING, qty INT64, unit_price NUMERIC>>
SELECT
    order_id,
    item.product_id,
    item.qty,
    item.unit_price,
    item.qty * item.unit_price AS line_total
FROM `myproject.sales.orders`,
UNNEST(line_items) AS item;

STRUCT Access with Dot Notation

Navigate nested STRUCTs using . — no casting required.

SQL
-- address is STRUCT<street STRING, city STRING, country STRING, postal STRING>
SELECT
    customer_id,
    address.city,
    address.country
FROM `myproject.crm.customers`
WHERE address.country = 'DE'
  AND address.postal LIKE '1%';

-- Reconstruct a STRUCT with modified fields
SELECT
    customer_id,
    STRUCT(
        address.street AS street,
        address.city AS city,
        UPPER(address.country) AS country,   -- normalize country code
        address.postal AS postal
    ) AS address
FROM `myproject.crm.customers`;

SAFE Functions

Prefix any function with SAFE. to return NULL instead of raising an error on invalid input. Invaluable when cleaning messy ingested data.

SQL
-- SAFE.DIVIDE prevents division-by-zero errors
SELECT
    product_id,
    total_revenue,
    order_count,
    SAFE_DIVIDE(total_revenue, order_count) AS avg_order_value
FROM `myproject.sales.product_summary`;

-- SAFE.CAST avoids runtime errors on malformed strings
SELECT
    row_id,
    SAFE_CAST(raw_amount AS NUMERIC) AS amount,
    SAFE_CAST(raw_date AS DATE) AS order_date
FROM `myproject.raw.staging_orders`;

-- SAFE.ST_GEOGFROMTEXT won't crash on invalid WKT strings
SELECT
    location_id,
    SAFE.ST_GEOGFROMTEXT(wkt_geometry) AS geo
FROM `myproject.spatial.raw_locations`;

APPROX_COUNT_DISTINCT

Exact COUNT(DISTINCT ...) requires reading every value and maintaining a hash set. APPROX_COUNT_DISTINCT uses the HyperLogLog++ algorithm — 1 % error, 40× faster on large datasets.

SQL
-- Exact (expensive on billions of rows)
SELECT COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';

-- Approximate (cheap, fast, good enough for dashboards)
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';

-- Compare accuracy vs cost across 90 days
SELECT
    event_date,
    APPROX_COUNT_DISTINCT(user_id) AS approx_dau,
    COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY event_date
ORDER BY event_date;

JSON_EXTRACT_SCALAR and JSON Functions

SQL
-- JSON stored as STRING column
SELECT
    event_id,
    JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
    CAST(JSON_EXTRACT_SCALAR(payload, '$.duration_ms') AS INT64) AS duration_ms,
    JSON_EXTRACT_SCALAR(payload, '$.page.url') AS page_url
FROM `myproject.events.raw`;

-- Native JSON type (BigQuery JSON column)
SELECT
    event_id,
    STRING(payload.user_id) AS user_id,
    INT64(payload.duration_ms) AS duration_ms,
    STRING(payload.page.url) AS page_url
FROM `myproject.events.structured`;

-- Convert JSON array to BigQuery ARRAY
SELECT
    session_id,
    JSON_EXTRACT_ARRAY(features_json, '$') AS feature_list
FROM `myproject.ml.training_data`;

Window Functions: ROWS vs RANGE Frames

Window functions perform calculations across a sliding window of rows. The frame clause controls which rows are included.

ROWS Frame (physical row positions)

SQL
-- 7-day rolling sum: exactly 7 preceding rows + current row
SELECT
    event_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_revenue
FROM `myproject.sales.daily_summary`;

RANGE Frame (logical value range)

SQL
-- RANGE-based frame: all rows with the same ORDER BY value
-- Useful when multiple rows share a timestamp
SELECT
    user_id,
    event_timestamp,
    event_type,
    SUM(revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_timestamp
        RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW
    ) AS last_7d_user_revenue
FROM `myproject.analytics.user_events`;

Ranking and Distribution Functions

SQL
SELECT
    product_id,
    category,
    revenue,
    -- Dense ranking within category
    DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS category_rank,
    -- Percentile within all products
    PERCENT_RANK() OVER (ORDER BY revenue) AS percentile,
    -- Divide into 10 buckets (deciles)
    NTILE(10) OVER (ORDER BY revenue DESC) AS revenue_decile,
    -- Lead/lag for period-over-period
    LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_revenue,
    LEAD(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS next_month_revenue
FROM `myproject.sales.product_monthly`;

WITH ROLLUP and WITH CUBE

GROUP BY ROLLUP computes subtotals at each grouping level. GROUP BY CUBE computes all possible subtotal combinations.

SQL
-- ROLLUP: subtotals per year, then per year+quarter, then grand total
SELECT
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(QUARTER FROM order_date) AS quarter,
    region,
    SUM(order_total) AS revenue
FROM `myproject.sales.orders`
GROUP BY ROLLUP(EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date), region)
ORDER BY year, quarter, region;

-- CUBE: every combination of dimensions
SELECT
    region,
    product_category,
    channel,
    SUM(revenue) AS total_revenue,
    -- GROUPING() returns 1 for rows that are subtotals on that dimension
    GROUPING(region) AS is_all_regions,
    GROUPING(product_category) AS is_all_categories
FROM `myproject.sales.order_items`
GROUP BY CUBE(region, product_category, channel)
ORDER BY is_all_regions, is_all_categories, region, product_category;

BigQuery Scripting

BigQuery supports multi-statement scripts executed as a single job. Use the console or bq query to run them.

DECLARE and SET

SQL
DECLARE start_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
DECLARE end_date DATE DEFAULT CURRENT_DATE();
DECLARE row_count INT64;

SET row_count = (
    SELECT COUNT(*)
    FROM `myproject.analytics.events`
    WHERE event_date BETWEEN start_date AND end_date
);

SELECT CONCAT('Rows in range: ', CAST(row_count AS STRING)) AS message;

IF / ELSE

SQL
DECLARE rows_loaded INT64;

SET rows_loaded = (SELECT COUNT(*) FROM `myproject.staging.today_load`);

IF rows_loaded = 0 THEN
    RAISE USING MESSAGE = 'No rows loaded today — aborting pipeline';
ELSEIF rows_loaded < 1000 THEN
    SELECT CONCAT('Warning: only ', CAST(rows_loaded AS STRING), ' rows loaded') AS warning;
ELSE
    INSERT INTO `myproject.analytics.events`
    SELECT * FROM `myproject.staging.today_load`;
    SELECT CONCAT('Loaded ', CAST(rows_loaded AS STRING), ' rows successfully') AS status;
END IF;

LOOP and WHILE

SQL
DECLARE i INT64 DEFAULT 0;
DECLARE batch_date DATE DEFAULT '2026-01-01';

WHILE batch_date <= '2026-05-07' DO
    -- Backfill one day at a time to control memory usage
    INSERT INTO `myproject.analytics.daily_agg`
    SELECT
        batch_date AS agg_date,
        COUNT(*) AS events,
        APPROX_COUNT_DISTINCT(user_id) AS unique_users
    FROM `myproject.events.raw`
    WHERE event_date = batch_date;

    SET batch_date = DATE_ADD(batch_date, INTERVAL 1 DAY);
    SET i = i + 1;
END WHILE;

SELECT CONCAT('Backfilled ', CAST(i AS STRING), ' days') AS result;

Stored Procedures

Stored procedures are named, reusable scripts stored in a BigQuery dataset.

SQL
-- Create the procedure
CREATE OR REPLACE PROCEDURE `myproject.utils.refresh_daily_summary`(
    IN target_date DATE,
    OUT rows_inserted INT64
)
BEGIN
    -- Delete existing data for the target date
    DELETE FROM `myproject.analytics.daily_summary`
    WHERE summary_date = target_date;

    -- Insert fresh aggregation
    INSERT INTO `myproject.analytics.daily_summary`
    SELECT
        target_date AS summary_date,
        region,
        COUNT(*) AS order_count,
        SUM(order_total) AS revenue,
        APPROX_COUNT_DISTINCT(user_id) AS unique_buyers
    FROM `myproject.sales.orders`
    WHERE DATE(order_timestamp) = target_date
    GROUP BY region;

    -- Return row count to caller
    SET rows_inserted = (
        SELECT COUNT(*) FROM `myproject.analytics.daily_summary`
        WHERE summary_date = target_date
    );
END;

-- Call the procedure
CALL `myproject.utils.refresh_daily_summary`(CURRENT_DATE(), @rows_out);
SELECT @rows_out AS rows_inserted;

Scheduled Queries

Schedule a query to run on a cron-like schedule directly in BigQuery — no orchestration tool needed for simple use cases.

SQL
-- This SQL is the "scheduled query body"
-- Configure schedule via console or bq CLI:
-- bq mk --transfer_config --project_id=myproject \
--   --data_source=scheduled_query \
--   --display_name="Daily revenue refresh" \
--   --schedule="every 24 hours" \
--   --params='{"query":"..."}'

DECLARE today DATE DEFAULT DATE(CURRENT_TIMESTAMP(), 'America/New_York');

DELETE FROM `myproject.analytics.daily_summary` WHERE summary_date = today;

INSERT INTO `myproject.analytics.daily_summary`
SELECT
    today AS summary_date,
    region,
    SUM(order_total) AS revenue
FROM `myproject.sales.orders`
WHERE DATE(order_timestamp, 'America/New_York') = today
GROUP BY region;

BigQuery ML: Models Inside the Warehouse

BigQuery ML lets you train and deploy ML models using SQL — no Python or external infrastructure required.

Linear Regression

SQL
CREATE OR REPLACE MODEL `myproject.ml.revenue_forecast`
OPTIONS (
    model_type = 'LINEAR_REG',
    input_label_cols = ['revenue'],
    data_split_method = 'RANDOM',
    data_split_eval_fraction = 0.2
) AS
SELECT
    EXTRACT(DAYOFWEEK FROM order_date) AS day_of_week,
    EXTRACT(MONTH FROM order_date) AS month,
    region,
    marketing_spend,
    revenue
FROM `myproject.sales.daily_by_region`
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31';

-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `myproject.ml.revenue_forecast`);

-- Run predictions
SELECT * FROM ML.PREDICT(
    MODEL `myproject.ml.revenue_forecast`,
    (
        SELECT
            EXTRACT(DAYOFWEEK FROM d) AS day_of_week,
            EXTRACT(MONTH FROM d) AS month,
            region,
            marketing_spend
        FROM UNNEST(GENERATE_DATE_ARRAY('2026-05-01', '2026-05-31')) AS d,
        UNNEST(['NORTH', 'SOUTH', 'EAST', 'WEST']) AS region
        CROSS JOIN (SELECT 50000 AS marketing_spend)
    )
);

Logistic Classification

SQL
CREATE OR REPLACE MODEL `myproject.ml.churn_predictor`
OPTIONS (
    model_type = 'LOGISTIC_REG',
    input_label_cols = ['churned'],
    l2_reg = 0.01
) AS
SELECT
    days_since_signup,
    total_orders,
    avg_order_value,
    days_since_last_order,
    support_tickets,
    churned
FROM `myproject.ml.customer_features`
WHERE partition_date = DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);

Geospatial Queries with ST_ Functions

SQL
-- Find all stores within 10 km of a warehouse
SELECT
    s.store_id,
    s.store_name,
    ROUND(ST_DISTANCE(s.location, w.location) / 1000, 2) AS distance_km
FROM `myproject.retail.stores` s
CROSS JOIN `myproject.logistics.warehouses` w
WHERE w.warehouse_id = 'WH-BERLIN'
  AND ST_DWithin(s.location, w.location, 10000)
ORDER BY distance_km;

-- Calculate delivery zone coverage (store buffers)
SELECT
    store_id,
    ST_AREA(ST_BUFFER(location, 5000)) / 1e6 AS coverage_km2
FROM `myproject.retail.stores`;

-- Spatial join: which sales region does each event come from?
SELECT
    e.event_id,
    e.event_timestamp,
    r.region_name
FROM `myproject.events.geotagged` e
JOIN `myproject.geo.sales_regions` r
  ON ST_CONTAINS(r.boundary, e.location);

-- Cluster nearby events into groups
SELECT
    ST_CLUSTERDBSCAN(location, 500, 5) OVER () AS cluster_id,
    event_id,
    event_timestamp
FROM `myproject.events.geotagged`
WHERE event_date = '2026-05-07';

Wildcard Tables

Query multiple tables with a shared prefix using the * wildcard. Common for date-sharded table layouts (old pattern, still seen in many pipelines).

SQL
-- Query all daily shards for May 2026
SELECT
    event_type,
    COUNT(*) AS events
FROM `myproject.events.raw_2026_05_*`
GROUP BY event_type;

-- Use _TABLE_SUFFIX to filter specific shards
SELECT
    _TABLE_SUFFIX AS shard_date,
    COUNT(*) AS events
FROM `myproject.events.raw_2026_*`
WHERE _TABLE_SUFFIX BETWEEN '05_01' AND '05_07'
GROUP BY shard_date
ORDER BY shard_date;

Prefer proper PARTITION BY tables over date-sharded tables for new projects. Partitioned tables have better metadata, cost controls, and lifecycle management.

INFORMATION_SCHEMA for Cost Monitoring

SQL
-- Top 10 most expensive queries in the last 7 days
SELECT
    user_email,
    SUBSTR(query, 1, 100) AS query_preview,
    ROUND(total_bytes_billed / POW(1024, 4), 4) AS tb_billed,
    -- On-demand pricing: $6.25 per TB
    ROUND(6.25 * total_bytes_billed / POW(1024, 4), 2) AS estimated_cost_usd,
    ROUND(total_slot_ms / 3600000, 2) AS slot_hours,
    creation_time
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
ORDER BY total_bytes_billed DESC
LIMIT 10;

-- Daily spend trend
SELECT
    DATE(creation_time) AS query_date,
    COUNT(*) AS query_count,
    ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed,
    ROUND(6.25 * SUM(total_bytes_billed) / POW(1024, 4), 2) AS estimated_daily_cost_usd
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
GROUP BY query_date
ORDER BY query_date;

-- Tables accessed most often (potential candidates for caching)
SELECT
    referenced_table.dataset_id,
    referenced_table.table_id,
    COUNT(*) AS query_count,
    ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`,
UNNEST(referenced_tables) AS referenced_table
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY referenced_table.dataset_id, referenced_table.table_id
ORDER BY query_count DESC
LIMIT 20;

Complete Use Case: Cohort Retention Analysis

Cohort retention answers: "Of the users who first used the product in month X, how many were still active in months X+1, X+2, ...?"

This implementation uses BigQuery arrays and window functions to compute the full retention matrix in a single query.

SQL
-- Step 1: Identify each user's cohort (first activity month)
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC(MIN(event_date), MONTH) AS cohort_month
    FROM `myproject.analytics.events`
    WHERE event_date >= '2025-01-01'
      AND event_type = 'session_start'
    GROUP BY user_id
),

-- Step 2: Get all active months per user
user_activity AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC(event_date, MONTH) AS activity_month
    FROM `myproject.analytics.events`
    WHERE event_date >= '2025-01-01'
      AND event_type = 'session_start'
),

-- Step 3: Join to get cohort month and period offset
cohort_activity AS (
    SELECT
        uc.cohort_month,
        ua.activity_month,
        DATE_DIFF(ua.activity_month, uc.cohort_month, MONTH) AS period_number,
        uc.user_id
    FROM user_cohorts uc
    JOIN user_activity ua USING (user_id)
),

-- Step 4: Count cohort size (period 0 = cohort month itself)
cohort_sizes AS (
    SELECT
        cohort_month,
        COUNT(DISTINCT user_id) AS cohort_size
    FROM cohort_activity
    WHERE period_number = 0
    GROUP BY cohort_month
),

-- Step 5: Count retained users per (cohort, period)
retention_counts AS (
    SELECT
        cohort_month,
        period_number,
        COUNT(DISTINCT user_id) AS retained_users
    FROM cohort_activity
    GROUP BY cohort_month, period_number
)

-- Step 6: Compute retention rate and pivot into final matrix
SELECT
    r.cohort_month,
    s.cohort_size,
    r.period_number,
    r.retained_users,
    ROUND(100.0 * r.retained_users / s.cohort_size, 1) AS retention_pct,
    -- Array of retention rates for the cohort (for charting)
    ARRAY_AGG(
        ROUND(100.0 * r.retained_users / s.cohort_size, 1)
        ORDER BY r.period_number
    ) OVER (PARTITION BY r.cohort_month) AS retention_curve
FROM retention_counts r
JOIN cohort_sizes s USING (cohort_month)
WHERE r.period_number BETWEEN 0 AND 11  -- first 12 months
ORDER BY r.cohort_month, r.period_number;

Reading the Output

| cohort_month | cohort_size | period_number | retained_users | retention_pct | |---|---|---|---|---| | 2025-01-01 | 4,820 | 0 | 4,820 | 100.0 | | 2025-01-01 | 4,820 | 1 | 2,731 | 56.7 | | 2025-01-01 | 4,820 | 2 | 1,944 | 40.3 | | 2025-02-01 | 5,103 | 0 | 5,103 | 100.0 | | 2025-02-01 | 5,103 | 1 | 2,968 | 58.2 |

The retention_curve array column contains the full series [100.0, 56.7, 40.3, ...] per cohort — easy to pass to a charting library without further transformation.

Extending: Week-over-Week Cohorts

SQL
-- Change DATE_TRUNC interval from MONTH to WEEK for weekly cohorts
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC(MIN(event_date), WEEK(MONDAY)) AS cohort_week
    FROM `myproject.analytics.events`
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
    GROUP BY user_id
)
-- ... rest of the query substituting cohort_week for cohort_month
-- and DATE_DIFF(..., WEEK) for DATE_DIFF(..., MONTH)

Summary

BigQuery's SQL dialect rewards data engineers who understand its extensions:

  • Use UNNEST for arrays — it is not optional, it is idiomatic BigQuery.
  • Prefer APPROX_COUNT_DISTINCT over exact COUNT(DISTINCT) in dashboards and monitoring queries.
  • Use SAFE. prefix functions to harden ETL queries against dirty data.
  • Window functions with RANGE BETWEEN INTERVAL are perfect for time-series rolling calculations.
  • INFORMATION_SCHEMA.JOBS_BY_PROJECT is your first stop when tracking down runaway query costs.

The next lesson covers the BigQuery Python client, pandas integration, and building production ETL pipelines with Airflow and dbt.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.