Learnixo

BigQuery for Data Engineers · Lesson 2 of 3

BigQuery SQL: UNNEST, Analytics, Scripting & ML

BigQuery SQL: Not Quite Standard SQL

BigQuery uses GoogleSQL (formerly Standard SQL) — largely ANSI-compatible but with important extensions and a few behavioural differences. This lesson covers what makes BigQuery's dialect distinct and how to exploit it in production analytics.

UNNEST: Working with Arrays

In BigQuery, arrays are first-class citizens. UNNEST expands an array column into rows, enabling joins between an array and the rest of the row.

SQL
-- Basic UNNEST: one row per tag
SELECT
    post_id,
    title,
    tag
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag;

-- UNNEST with offset to track array position
SELECT
    post_id,
    tag,
    tag_position
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag WITH OFFSET AS tag_position
ORDER BY post_id, tag_position;

-- Filter: only posts that have the tag 'bigquery'
SELECT post_id, title
FROM `myproject.blog.posts`
WHERE 'bigquery' IN UNNEST(tags);

-- Build a histogram of tag frequency across all posts
SELECT
    tag,
    COUNT(*) AS post_count
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag
GROUP BY tag
ORDER BY post_count DESC
LIMIT 20;

UNNEST on a STRUCT ARRAY

SQL
-- Schema: line_items ARRAY<STRUCT<product_id STRING, qty INT64, unit_price NUMERIC>>
SELECT
    order_id,
    item.product_id,
    item.qty,
    item.unit_price,
    item.qty * item.unit_price AS line_total
FROM `myproject.sales.orders`,
UNNEST(line_items) AS item;

STRUCT Access with Dot Notation

Navigate nested STRUCTs using . — no casting required.

SQL
-- address is STRUCT<street STRING, city STRING, country STRING, postal STRING>
SELECT
    customer_id,
    address.city,
    address.country
FROM `myproject.crm.customers`
WHERE address.country = 'DE'
  AND address.postal LIKE '1%';

-- Reconstruct a STRUCT with modified fields
SELECT
    customer_id,
    STRUCT(
        address.street AS street,
        address.city AS city,
        UPPER(address.country) AS country,   -- normalize country code
        address.postal AS postal
    ) AS address
FROM `myproject.crm.customers`;

SAFE Functions

Prefix any function with SAFE. to return NULL instead of raising an error on invalid input. Invaluable when cleaning messy ingested data.

SQL
-- SAFE.DIVIDE prevents division-by-zero errors
SELECT
    product_id,
    total_revenue,
    order_count,
    SAFE_DIVIDE(total_revenue, order_count) AS avg_order_value
FROM `myproject.sales.product_summary`;

-- SAFE.CAST avoids runtime errors on malformed strings
SELECT
    row_id,
    SAFE_CAST(raw_amount AS NUMERIC) AS amount,
    SAFE_CAST(raw_date AS DATE) AS order_date
FROM `myproject.raw.staging_orders`;

-- SAFE.ST_GEOGFROMTEXT won't crash on invalid WKT strings
SELECT
    location_id,
    SAFE.ST_GEOGFROMTEXT(wkt_geometry) AS geo
FROM `myproject.spatial.raw_locations`;

APPROX_COUNT_DISTINCT

Exact COUNT(DISTINCT ...) requires reading every value and maintaining a hash set. APPROX_COUNT_DISTINCT uses the HyperLogLog++ algorithm — 1 % error, 40× faster on large datasets.

SQL
-- Exact (expensive on billions of rows)
SELECT COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';

-- Approximate (cheap, fast, good enough for dashboards)
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM `myproject.analytics.events`
WHERE event_date = '2026-05-07';

-- Compare accuracy vs cost across 90 days
SELECT
    event_date,
    APPROX_COUNT_DISTINCT(user_id) AS approx_dau,
    COUNT(DISTINCT user_id) AS exact_dau
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY event_date
ORDER BY event_date;

JSON_EXTRACT_SCALAR and JSON Functions

SQL
-- JSON stored as STRING column
SELECT
    event_id,
    JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
    CAST(JSON_EXTRACT_SCALAR(payload, '$.duration_ms') AS INT64) AS duration_ms,
    JSON_EXTRACT_SCALAR(payload, '$.page.url') AS page_url
FROM `myproject.events.raw`;

-- Native JSON type (BigQuery JSON column)
SELECT
    event_id,
    STRING(payload.user_id) AS user_id,
    INT64(payload.duration_ms) AS duration_ms,
    STRING(payload.page.url) AS page_url
FROM `myproject.events.structured`;

-- Convert JSON array to BigQuery ARRAY
SELECT
    session_id,
    JSON_EXTRACT_ARRAY(features_json, '$') AS feature_list
FROM `myproject.ml.training_data`;

Window Functions: ROWS vs RANGE Frames

Window functions perform calculations across a sliding window of rows. The frame clause controls which rows are included.

ROWS Frame (physical row positions)

SQL
-- 7-day rolling sum: exactly 7 preceding rows + current row
SELECT
    event_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_revenue
FROM `myproject.sales.daily_summary`;

RANGE Frame (logical value range)

SQL
-- RANGE-based frame: all rows with the same ORDER BY value
-- Useful when multiple rows share a timestamp
SELECT
    user_id,
    event_timestamp,
    event_type,
    SUM(revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_timestamp
        RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW
    ) AS last_7d_user_revenue
FROM `myproject.analytics.user_events`;

Ranking and Distribution Functions

SQL
SELECT
    product_id,
    category,
    revenue,
    -- Dense ranking within category
    DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS category_rank,
    -- Percentile within all products
    PERCENT_RANK() OVER (ORDER BY revenue) AS percentile,
    -- Divide into 10 buckets (deciles)
    NTILE(10) OVER (ORDER BY revenue DESC) AS revenue_decile,
    -- Lead/lag for period-over-period
    LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_revenue,
    LEAD(revenue, 1) OVER (PARTITION BY product_id ORDER BY month) AS next_month_revenue
FROM `myproject.sales.product_monthly`;

WITH ROLLUP and WITH CUBE

GROUP BY ROLLUP computes subtotals at each grouping level. GROUP BY CUBE computes all possible subtotal combinations.

SQL
-- ROLLUP: subtotals per year, then per year+quarter, then grand total
SELECT
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(QUARTER FROM order_date) AS quarter,
    region,
    SUM(order_total) AS revenue
FROM `myproject.sales.orders`
GROUP BY ROLLUP(EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date), region)
ORDER BY year, quarter, region;

-- CUBE: every combination of dimensions
SELECT
    region,
    product_category,
    channel,
    SUM(revenue) AS total_revenue,
    -- GROUPING() returns 1 for rows that are subtotals on that dimension
    GROUPING(region) AS is_all_regions,
    GROUPING(product_category) AS is_all_categories
FROM `myproject.sales.order_items`
GROUP BY CUBE(region, product_category, channel)
ORDER BY is_all_regions, is_all_categories, region, product_category;

BigQuery Scripting

BigQuery supports multi-statement scripts executed as a single job. Use the console or bq query to run them.

DECLARE and SET

SQL
DECLARE start_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
DECLARE end_date DATE DEFAULT CURRENT_DATE();
DECLARE row_count INT64;

SET row_count = (
    SELECT COUNT(*)
    FROM `myproject.analytics.events`
    WHERE event_date BETWEEN start_date AND end_date
);

SELECT CONCAT('Rows in range: ', CAST(row_count AS STRING)) AS message;

IF / ELSE

SQL
DECLARE rows_loaded INT64;

SET rows_loaded = (SELECT COUNT(*) FROM `myproject.staging.today_load`);

IF rows_loaded = 0 THEN
    RAISE USING MESSAGE = 'No rows loaded today — aborting pipeline';
ELSEIF rows_loaded < 1000 THEN
    SELECT CONCAT('Warning: only ', CAST(rows_loaded AS STRING), ' rows loaded') AS warning;
ELSE
    INSERT INTO `myproject.analytics.events`
    SELECT * FROM `myproject.staging.today_load`;
    SELECT CONCAT('Loaded ', CAST(rows_loaded AS STRING), ' rows successfully') AS status;
END IF;

LOOP and WHILE

SQL
DECLARE i INT64 DEFAULT 0;
DECLARE batch_date DATE DEFAULT '2026-01-01';

WHILE batch_date <= '2026-05-07' DO
    -- Backfill one day at a time to control memory usage
    INSERT INTO `myproject.analytics.daily_agg`
    SELECT
        batch_date AS agg_date,
        COUNT(*) AS events,
        APPROX_COUNT_DISTINCT(user_id) AS unique_users
    FROM `myproject.events.raw`
    WHERE event_date = batch_date;

    SET batch_date = DATE_ADD(batch_date, INTERVAL 1 DAY);
    SET i = i + 1;
END WHILE;

SELECT CONCAT('Backfilled ', CAST(i AS STRING), ' days') AS result;

Stored Procedures

Stored procedures are named, reusable scripts stored in a BigQuery dataset.

SQL
-- Create the procedure
CREATE OR REPLACE PROCEDURE `myproject.utils.refresh_daily_summary`(
    IN target_date DATE,
    OUT rows_inserted INT64
)
BEGIN
    -- Delete existing data for the target date
    DELETE FROM `myproject.analytics.daily_summary`
    WHERE summary_date = target_date;

    -- Insert fresh aggregation
    INSERT INTO `myproject.analytics.daily_summary`
    SELECT
        target_date AS summary_date,
        region,
        COUNT(*) AS order_count,
        SUM(order_total) AS revenue,
        APPROX_COUNT_DISTINCT(user_id) AS unique_buyers
    FROM `myproject.sales.orders`
    WHERE DATE(order_timestamp) = target_date
    GROUP BY region;

    -- Return row count to caller
    SET rows_inserted = (
        SELECT COUNT(*) FROM `myproject.analytics.daily_summary`
        WHERE summary_date = target_date
    );
END;

-- Call the procedure
CALL `myproject.utils.refresh_daily_summary`(CURRENT_DATE(), @rows_out);
SELECT @rows_out AS rows_inserted;

Scheduled Queries

Schedule a query to run on a cron-like schedule directly in BigQuery — no orchestration tool needed for simple use cases.

SQL
-- This SQL is the "scheduled query body"
-- Configure schedule via console or bq CLI:
-- bq mk --transfer_config --project_id=myproject \
--   --data_source=scheduled_query \
--   --display_name="Daily revenue refresh" \
--   --schedule="every 24 hours" \
--   --params='{"query":"..."}'

DECLARE today DATE DEFAULT DATE(CURRENT_TIMESTAMP(), 'America/New_York');

DELETE FROM `myproject.analytics.daily_summary` WHERE summary_date = today;

INSERT INTO `myproject.analytics.daily_summary`
SELECT
    today AS summary_date,
    region,
    SUM(order_total) AS revenue
FROM `myproject.sales.orders`
WHERE DATE(order_timestamp, 'America/New_York') = today
GROUP BY region;

BigQuery ML: Models Inside the Warehouse

BigQuery ML lets you train and deploy ML models using SQL — no Python or external infrastructure required.

Linear Regression

SQL
CREATE OR REPLACE MODEL `myproject.ml.revenue_forecast`
OPTIONS (
    model_type = 'LINEAR_REG',
    input_label_cols = ['revenue'],
    data_split_method = 'RANDOM',
    data_split_eval_fraction = 0.2
) AS
SELECT
    EXTRACT(DAYOFWEEK FROM order_date) AS day_of_week,
    EXTRACT(MONTH FROM order_date) AS month,
    region,
    marketing_spend,
    revenue
FROM `myproject.sales.daily_by_region`
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31';

-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `myproject.ml.revenue_forecast`);

-- Run predictions
SELECT * FROM ML.PREDICT(
    MODEL `myproject.ml.revenue_forecast`,
    (
        SELECT
            EXTRACT(DAYOFWEEK FROM d) AS day_of_week,
            EXTRACT(MONTH FROM d) AS month,
            region,
            marketing_spend
        FROM UNNEST(GENERATE_DATE_ARRAY('2026-05-01', '2026-05-31')) AS d,
        UNNEST(['NORTH', 'SOUTH', 'EAST', 'WEST']) AS region
        CROSS JOIN (SELECT 50000 AS marketing_spend)
    )
);

Logistic Classification

SQL
CREATE OR REPLACE MODEL `myproject.ml.churn_predictor`
OPTIONS (
    model_type = 'LOGISTIC_REG',
    input_label_cols = ['churned'],
    l2_reg = 0.01
) AS
SELECT
    days_since_signup,
    total_orders,
    avg_order_value,
    days_since_last_order,
    support_tickets,
    churned
FROM `myproject.ml.customer_features`
WHERE partition_date = DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);

Geospatial Queries with ST_ Functions

SQL
-- Find all stores within 10 km of a warehouse
SELECT
    s.store_id,
    s.store_name,
    ROUND(ST_DISTANCE(s.location, w.location) / 1000, 2) AS distance_km
FROM `myproject.retail.stores` s
CROSS JOIN `myproject.logistics.warehouses` w
WHERE w.warehouse_id = 'WH-BERLIN'
  AND ST_DWithin(s.location, w.location, 10000)
ORDER BY distance_km;

-- Calculate delivery zone coverage (store buffers)
SELECT
    store_id,
    ST_AREA(ST_BUFFER(location, 5000)) / 1e6 AS coverage_km2
FROM `myproject.retail.stores`;

-- Spatial join: which sales region does each event come from?
SELECT
    e.event_id,
    e.event_timestamp,
    r.region_name
FROM `myproject.events.geotagged` e
JOIN `myproject.geo.sales_regions` r
  ON ST_CONTAINS(r.boundary, e.location);

-- Cluster nearby events into groups
SELECT
    ST_CLUSTERDBSCAN(location, 500, 5) OVER () AS cluster_id,
    event_id,
    event_timestamp
FROM `myproject.events.geotagged`
WHERE event_date = '2026-05-07';

Wildcard Tables

Query multiple tables with a shared prefix using the * wildcard. Common for date-sharded table layouts (old pattern, still seen in many pipelines).

SQL
-- Query all daily shards for May 2026
SELECT
    event_type,
    COUNT(*) AS events
FROM `myproject.events.raw_2026_05_*`
GROUP BY event_type;

-- Use _TABLE_SUFFIX to filter specific shards
SELECT
    _TABLE_SUFFIX AS shard_date,
    COUNT(*) AS events
FROM `myproject.events.raw_2026_*`
WHERE _TABLE_SUFFIX BETWEEN '05_01' AND '05_07'
GROUP BY shard_date
ORDER BY shard_date;

Prefer proper PARTITION BY tables over date-sharded tables for new projects. Partitioned tables have better metadata, cost controls, and lifecycle management.

INFORMATION_SCHEMA for Cost Monitoring

SQL
-- Top 10 most expensive queries in the last 7 days
SELECT
    user_email,
    SUBSTR(query, 1, 100) AS query_preview,
    ROUND(total_bytes_billed / POW(1024, 4), 4) AS tb_billed,
    -- On-demand pricing: $6.25 per TB
    ROUND(6.25 * total_bytes_billed / POW(1024, 4), 2) AS estimated_cost_usd,
    ROUND(total_slot_ms / 3600000, 2) AS slot_hours,
    creation_time
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
ORDER BY total_bytes_billed DESC
LIMIT 10;

-- Daily spend trend
SELECT
    DATE(creation_time) AS query_date,
    COUNT(*) AS query_count,
    ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed,
    ROUND(6.25 * SUM(total_bytes_billed) / POW(1024, 4), 2) AS estimated_daily_cost_usd
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
GROUP BY query_date
ORDER BY query_date;

-- Tables accessed most often (potential candidates for caching)
SELECT
    referenced_table.dataset_id,
    referenced_table.table_id,
    COUNT(*) AS query_count,
    ROUND(SUM(total_bytes_billed) / POW(1024, 4), 2) AS total_tb_billed
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`,
UNNEST(referenced_tables) AS referenced_table
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY referenced_table.dataset_id, referenced_table.table_id
ORDER BY query_count DESC
LIMIT 20;

Complete Use Case: Cohort Retention Analysis

Cohort retention answers: "Of the users who first used the product in month X, how many were still active in months X+1, X+2, ...?"

This implementation uses BigQuery arrays and window functions to compute the full retention matrix in a single query.

SQL
-- Step 1: Identify each user's cohort (first activity month)
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC(MIN(event_date), MONTH) AS cohort_month
    FROM `myproject.analytics.events`
    WHERE event_date >= '2025-01-01'
      AND event_type = 'session_start'
    GROUP BY user_id
),

-- Step 2: Get all active months per user
user_activity AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC(event_date, MONTH) AS activity_month
    FROM `myproject.analytics.events`
    WHERE event_date >= '2025-01-01'
      AND event_type = 'session_start'
),

-- Step 3: Join to get cohort month and period offset
cohort_activity AS (
    SELECT
        uc.cohort_month,
        ua.activity_month,
        DATE_DIFF(ua.activity_month, uc.cohort_month, MONTH) AS period_number,
        uc.user_id
    FROM user_cohorts uc
    JOIN user_activity ua USING (user_id)
),

-- Step 4: Count cohort size (period 0 = cohort month itself)
cohort_sizes AS (
    SELECT
        cohort_month,
        COUNT(DISTINCT user_id) AS cohort_size
    FROM cohort_activity
    WHERE period_number = 0
    GROUP BY cohort_month
),

-- Step 5: Count retained users per (cohort, period)
retention_counts AS (
    SELECT
        cohort_month,
        period_number,
        COUNT(DISTINCT user_id) AS retained_users
    FROM cohort_activity
    GROUP BY cohort_month, period_number
)

-- Step 6: Compute retention rate and pivot into final matrix
SELECT
    r.cohort_month,
    s.cohort_size,
    r.period_number,
    r.retained_users,
    ROUND(100.0 * r.retained_users / s.cohort_size, 1) AS retention_pct,
    -- Array of retention rates for the cohort (for charting)
    ARRAY_AGG(
        ROUND(100.0 * r.retained_users / s.cohort_size, 1)
        ORDER BY r.period_number
    ) OVER (PARTITION BY r.cohort_month) AS retention_curve
FROM retention_counts r
JOIN cohort_sizes s USING (cohort_month)
WHERE r.period_number BETWEEN 0 AND 11  -- first 12 months
ORDER BY r.cohort_month, r.period_number;

Reading the Output

| cohort_month | cohort_size | period_number | retained_users | retention_pct | |---|---|---|---|---| | 2025-01-01 | 4,820 | 0 | 4,820 | 100.0 | | 2025-01-01 | 4,820 | 1 | 2,731 | 56.7 | | 2025-01-01 | 4,820 | 2 | 1,944 | 40.3 | | 2025-02-01 | 5,103 | 0 | 5,103 | 100.0 | | 2025-02-01 | 5,103 | 1 | 2,968 | 58.2 |

The retention_curve array column contains the full series [100.0, 56.7, 40.3, ...] per cohort — easy to pass to a charting library without further transformation.

Extending: Week-over-Week Cohorts

SQL
-- Change DATE_TRUNC interval from MONTH to WEEK for weekly cohorts
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC(MIN(event_date), WEEK(MONDAY)) AS cohort_week
    FROM `myproject.analytics.events`
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
    GROUP BY user_id
)
-- ... rest of the query substituting cohort_week for cohort_month
-- and DATE_DIFF(..., WEEK) for DATE_DIFF(..., MONTH)

Summary

BigQuery's SQL dialect rewards data engineers who understand its extensions:

  • Use UNNEST for arrays — it is not optional, it is idiomatic BigQuery.
  • Prefer APPROX_COUNT_DISTINCT over exact COUNT(DISTINCT) in dashboards and monitoring queries.
  • Use SAFE. prefix functions to harden ETL queries against dirty data.
  • Window functions with RANGE BETWEEN INTERVAL are perfect for time-series rolling calculations.
  • INFORMATION_SCHEMA.JOBS_BY_PROJECT is your first stop when tracking down runaway query costs.

The next lesson covers the BigQuery Python client, pandas integration, and building production ETL pipelines with Airflow and dbt.