BigQuery Fundamentals for Data Engineers

What Makes BigQuery Different

BigQuery is Google's fully managed, serverless data warehouse. You do not provision clusters, tune memory settings, or manage storage — you write SQL and Google handles the rest. That serverless model changes the economics of analytics: storage and compute are billed separately, idle time costs nothing, and a single query can scan petabytes in seconds.

Three engineering decisions underpin every BigQuery performance and cost conversation:

Columnar storage via Capacitor — data is stored in Google's proprietary columnar format, so a query reading 3 of 50 columns touches roughly 6 % of the data a row store would read.
Dremel execution engine — queries fan out across thousands of worker nodes automatically; you observe the parallelism only as fast results.
Separate compute and storage — compute scales to zero when idle; you pay for storage (compressed bytes) and compute (bytes scanned by queries).

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     BigQuery Service                     │
│                                                          │
│  ┌──────────┐   ┌─────────────┐   ┌──────────────────┐  │
│  │  Dremel  │   │   Colossus  │   │  Jupiter Network │  │
│  │ (compute)│   │  (storage)  │   │  (10 Pb/s xfer)  │  │
│  └────┬─────┘   └──────┬──────┘   └────────┬─────────┘  │
│       │                │                    │            │
│       └────────────────┴────────────────────┘            │
└─────────────────────────────────────────────────────────┘
         ▲                             ▲
         │ SQL queries                 │ Streaming / batch load
    Client tools                  Source data

Colossus is Google's distributed file system — it stores Capacitor-formatted column files. Because compute (Dremel) and storage (Colossus) communicate over the ultra-fast Jupiter network, there is no I/O bottleneck between a "compute node" and "its" disk. This is why BigQuery outperforms traditional MPP systems at scale.

Resource Hierarchy: Projects, Datasets, Tables

Every BigQuery resource lives inside a three-level hierarchy:

GCP Project  (billing, IAM, quotas)
└── Dataset  (namespace, access controls, location, default TTL)
    ├── Table          (native or external data)
    ├── View           (saved SQL query, virtual table)
    └── Materialized View (cached query result, auto-refreshed)

A project holds the billing account and is the unit for quota enforcement.
A dataset is a namespace. Tables inside a dataset inherit its region (US, EU, asia-northeast1, etc.). You cannot join tables across regions in a single query.
A table is the actual data container.

Fully qualified table references follow the pattern: project_id.dataset_id.table_id.

Interacting with BigQuery

BigQuery Console

The web UI at console.cloud.google.com/bigquery supports ad-hoc queries, schema browsing, and job history. Useful for exploration; avoid it for automated pipelines.

bq CLI

The bq command-line tool ships with the Google Cloud SDK.

Bash

# Authenticate
gcloud auth application-default login

# Run a query (prints results to stdout)
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`myproject.mydataset.orders\`"

# Show table schema
bq show --schema --format=prettyjson myproject:mydataset.orders

# Load CSV from Cloud Storage
bq load \
  --source_format=CSV \
  --autodetect \
  myproject:mydataset.orders \
  gs://my-bucket/orders/*.csv

# Export table to Cloud Storage as Parquet
bq extract \
  --destination_format=PARQUET \
  myproject:mydataset.orders \
  gs://my-bucket/export/orders_*.parquet

# Create a dataset in a specific region
bq mk --location=EU myproject:analytics

# Estimate bytes scanned before running (dry run)
bq query --use_legacy_sql=false --dry_run \
  "SELECT * FROM \`myproject.mydataset.events\` WHERE event_date = '2026-01-01'"

Python Client

Python

from google.cloud import bigquery

# Application Default Credentials (ADC) — recommended for Cloud environments
client = bigquery.Client(project="myproject")

# Run a query
query = """
    SELECT
        user_id,
        COUNT(*) AS session_count
    FROM `myproject.analytics.sessions`
    WHERE session_date BETWEEN '2026-01-01' AND '2026-01-31'
    GROUP BY user_id
    ORDER BY session_count DESC
    LIMIT 100
"""

results = client.query(query).to_dataframe()
print(results.head())

Data Types

Primitive Types

| Type | Notes | |------|-------| | INT64 / INTEGER | 64-bit signed integer | | FLOAT64 / FLOAT | Double-precision float | | NUMERIC | 29 digits of precision, 9 decimal scale — use for money | | BIGNUMERIC | 76 digits of precision, 38 decimal scale | | STRING | UTF-8, up to 16 MB | | BYTES | Raw bytes | | BOOL | true / false | | DATE | Calendar date, no time | | TIME | Time of day, no date | | DATETIME | Date + time, no timezone | | TIMESTAMP | Absolute point in time (UTC internally) |

STRUCT (Record Type)

A STRUCT is an ordered collection of typed fields — like a row embedded inside a column:

SQL

-- Table schema contains a STRUCT column named "address"
SELECT
    customer_id,
    address.street,
    address.city,
    address.postal_code
FROM `myproject.crm.customers`
WHERE address.country = 'DE';

You can also construct STRUCTs inline:

SQL

SELECT
    order_id,
    STRUCT(
        ship_to_name AS name,
        ship_to_street AS street,
        ship_to_city AS city
    ) AS shipping_address
FROM `myproject.sales.orders`;

ARRAY Type

ARRAYs hold zero or more values of the same type. Every row can have a different number of elements.

SQL

-- Select the first tag from an array column
SELECT
    post_id,
    tags[OFFSET(0)] AS primary_tag,
    ARRAY_LENGTH(tags) AS tag_count
FROM `myproject.blog.posts`;

-- Flatten arrays: one output row per element
SELECT
    post_id,
    tag
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag;

-- Build an array with ARRAY_AGG
SELECT
    user_id,
    ARRAY_AGG(DISTINCT product_id ORDER BY product_id) AS purchased_products
FROM `myproject.sales.orders`
GROUP BY user_id;

JSON Type

BigQuery's native JSON type stores semi-structured data efficiently without casting to STRING:

SQL

-- Insert a JSON value
INSERT INTO `myproject.events.raw` (event_id, payload)
VALUES (1, JSON '{"action":"click","element":"btn-signup","ms":342}');

-- Extract fields from JSON
SELECT
    event_id,
    STRING(payload.action) AS action,
    INT64(payload.ms) AS latency_ms
FROM `myproject.events.raw`;

-- Use JSON_VALUE for legacy JSON stored as STRING
SELECT
    JSON_VALUE(payload_str, '$.action') AS action,
    CAST(JSON_VALUE(payload_str, '$.ms') AS INT64) AS latency_ms
FROM `myproject.events.legacy`;

GEOGRAPHY Type

Stores geographic points, lines, and polygons using the WGS84 coordinate system:

SQL

SELECT
    store_name,
    ST_DISTANCE(
        location,
        ST_GEOGPOINT(-73.9857, 40.7484)  -- Times Square
    ) / 1000 AS distance_km
FROM `myproject.retail.stores`
WHERE ST_DWithin(location, ST_GEOGPOINT(-73.9857, 40.7484), 5000)  -- within 5 km
ORDER BY distance_km;

Table Types

Native Tables

Standard BigQuery tables — data is stored in Capacitor format in Colossus. All performance optimizations (partitioning, clustering) apply here.

External Tables

Data stays in Cloud Storage, Google Drive, Bigtable, or Cloud Spanner. BigQuery reads it on-demand. Useful for landing raw data before transformation.

SQL

-- Create an external table over Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `myproject.raw.clickstream`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/clickstream/year=2026/month=*/*.parquet']
);

External tables do not benefit from partitioning pruning or clustering. Always materialize frequently queried external data into native tables.

Views

A saved SQL statement treated as a virtual table. No data is stored; the query runs every time the view is referenced.

SQL

CREATE OR REPLACE VIEW `myproject.analytics.active_users` AS
SELECT
    user_id,
    MAX(session_date) AS last_active,
    COUNT(*) AS total_sessions
FROM `myproject.analytics.sessions`
WHERE session_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY user_id;

Materialized Views

Like a view, but BigQuery caches the result and refreshes it automatically (up to every 5 minutes, or when the base table changes).

SQL

CREATE MATERIALIZED VIEW `myproject.analytics.daily_revenue`
PARTITION BY revenue_date
CLUSTER BY region
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)
AS
SELECT
    DATE(order_timestamp) AS revenue_date,
    region,
    SUM(order_total) AS total_revenue,
    COUNT(*) AS order_count
FROM `myproject.sales.orders`
GROUP BY revenue_date, region;

Queries that filter on revenue_date or region can use the materialized view automatically via smart tuning — even if they query the base orders table directly.

Partitioning

Partitioning splits a table into segments so queries can skip irrelevant data. BigQuery supports four partition strategies.

Partition by DATE / DATETIME / TIMESTAMP

SQL

CREATE TABLE `myproject.analytics.events`
(
    event_id    STRING NOT NULL,
    event_date  DATE NOT NULL,
    user_id     STRING,
    event_type  STRING,
    properties  JSON
)
PARTITION BY event_date
OPTIONS (
    partition_expiration_days = 365,
    require_partition_filter = TRUE   -- force callers to filter on event_date
);

require_partition_filter = TRUE is a production safety net — it prevents accidental full-table scans that could cost thousands of dollars.

Partition by TIMESTAMP Column (Ingestion Time)

SQL

CREATE TABLE `myproject.logs.app_logs`
(
    log_id      STRING,
    severity    STRING,
    message     STRING,
    logged_at   TIMESTAMP
)
PARTITION BY TIMESTAMP_TRUNC(logged_at, DAY);

BigQuery automatically creates a partition for each day and tracks partition boundaries.

Integer Range Partitioning

For non-time dimensions like customer region codes or numeric IDs:

SQL

CREATE TABLE `myproject.sales.transactions`
(
    transaction_id INT64,
    region_code    INT64,
    amount         NUMERIC
)
PARTITION BY RANGE_BUCKET(region_code, GENERATE_ARRAY(0, 100, 10));
-- Creates partitions: [0,10), [10,20), ..., [90,100), plus overflow

Clustering

Clustering sorts the data within each partition (or within the whole table if unpartitioned) by up to 4 columns. BigQuery uses block metadata to skip entire file blocks when a query filters on clustered columns.

SQL

CREATE TABLE `myproject.analytics.pageviews`
(
    view_id     STRING,
    view_date   DATE,
    user_id     STRING,
    country     STRING,
    device_type STRING,
    page_path   STRING,
    duration_ms INT64
)
PARTITION BY view_date
CLUSTER BY country, device_type, user_id;
-- Column order matters: filter on country first for best pruning

Rules of thumb:

Put the highest-cardinality filter columns first in the cluster order.
Clustering is most effective on columns frequently used in WHERE, JOIN, or GROUP BY.
Re-clustering happens automatically over time as new data arrives; you can also manually trigger it.

SQL

-- Check if a table needs re-clustering
SELECT
    table_name,
    total_rows,
    total_logical_bytes / POW(1024, 3) AS size_gb,
    last_modified_time
FROM `myproject.analytics.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 'pageviews';

Loading Data

bq load (batch, Cloud Storage → BigQuery)

Bash

# Load Newline-Delimited JSON
bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --schema=schema.json \
  myproject:analytics.events \
  gs://my-bucket/events/2026/05/07/*.ndjson

# Load Parquet with auto-detect schema
bq load \
  --source_format=PARQUET \
  --autodetect \
  --replace \
  myproject:analytics.events \
  gs://my-bucket/events/2026/05/07/*.parquet

Streaming Inserts

Streaming inserts write rows immediately (available for queries within seconds), but cost more per byte than batch loads:

Python

from google.cloud import bigquery

client = bigquery.Client()
table_ref = "myproject.analytics.events"

rows = [
    {"event_id": "e001", "event_date": "2026-05-07", "event_type": "signup"},
    {"event_id": "e002", "event_date": "2026-05-07", "event_type": "login"},
]

errors = client.insert_rows_json(table_ref, rows)
if errors:
    raise RuntimeError(f"Streaming insert errors: {errors}")

In production, use streaming inserts only when sub-minute latency is required. Otherwise prefer batch loads — they are cheaper and do not count against streaming quotas.

Cloud Storage Transfer

For large, recurring loads, use Cloud Storage as the staging layer. Drop files from your source system into GCS, then trigger a BigQuery load job via Pub/Sub notification or Cloud Scheduler.

Exporting Data

Bash

# Export to Parquet (recommended for downstream processing)
bq extract \
  --destination_format=PARQUET \
  --compression=SNAPPY \
  myproject:analytics.pageviews \
  gs://my-bucket/exports/pageviews_*.parquet

# Export a query result (not the full table)
bq query \
  --use_legacy_sql=false \
  --destination_table=myproject:temp.export_result \
  --replace \
  "SELECT * FROM \`myproject.analytics.pageviews\` WHERE view_date = '2026-05-01'"

bq extract myproject:temp.export_result gs://my-bucket/exports/pageviews_20260501_*.csv

INFORMATION_SCHEMA: Table Metadata Queries

BigQuery exposes rich metadata through INFORMATION_SCHEMA views — essential for platform-level observability.

SQL

-- All tables in a dataset with size and row count
SELECT
    table_name,
    table_type,
    creation_time,
    last_modified_time,
    row_count,
    ROUND(size_bytes / POW(1024, 3), 2) AS size_gb,
    clustering_fields,
    time_partitioning_type,
    time_partitioning_field
FROM `myproject.analytics.INFORMATION_SCHEMA.TABLE_OPTIONS`
JOIN `myproject.analytics.INFORMATION_SCHEMA.TABLES` USING (table_name)
ORDER BY size_bytes DESC;

-- Column-level metadata
SELECT
    table_name,
    column_name,
    ordinal_position,
    data_type,
    is_nullable
FROM `myproject.analytics.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'events'
ORDER BY ordinal_position;

-- Job history: queries in the last 24 hours, sorted by bytes billed
SELECT
    job_id,
    user_email,
    query,
    ROUND(total_bytes_billed / POW(1024, 4), 4) AS tb_billed,
    ROUND(total_slot_ms / 1000, 1) AS slot_seconds,
    TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_sec,
    state
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND job_type = 'QUERY'
ORDER BY total_bytes_billed DESC
LIMIT 50;

BigQuery vs Snowflake vs Redshift

| Dimension | BigQuery | Snowflake | Amazon Redshift | |-----------|----------|-----------|-----------------| | Compute model | Serverless (auto-scales) | Virtual warehouses (manual sizing) | Provisioned clusters + Serverless option | | Pricing | Per-query bytes scanned + storage | Per warehouse-hour + storage | Per node-hour + storage | | Idle cost | $0 | Warehouse must be paused manually | Cluster runs 24/7 unless scaled to 0 | | Storage format | Capacitor (proprietary columnar) | PAX (hybrid row/column) | Redshift-specific columnar | | SQL dialect | Standard SQL + extensions | ANSI SQL + extensions | PostgreSQL-based | | Semi-structured | Native JSON, STRUCT, ARRAY | VARIANT type | SUPER type | | Geospatial | Native ST_ functions | Limited | PostGIS-like via extensions | | ML integration | BigQuery ML (native CREATE MODEL) | Snowpark ML | Amazon Redshift ML | | External tables | GCS, Drive, Bigtable, Spanner | S3, GCS, Azure Blob | S3 (Spectrum) | | Max query size | No practical limit | No practical limit | 16 MB query text | | Best for | GCP ecosystem, ad-hoc at scale | Cross-cloud, governed enterprise | AWS ecosystem, existing Redshift users |

Quick Reference: Common DDL Patterns

SQL

-- Partitioned and clustered table (production template)
CREATE TABLE IF NOT EXISTS `myproject.sales.order_items`
(
    order_id          STRING NOT NULL,
    item_id           STRING NOT NULL,
    created_date      DATE NOT NULL,
    product_id        STRING,
    category          STRING,
    unit_price        NUMERIC(10, 2),
    quantity          INT64,
    warehouse_region  STRING,
    metadata          JSON
)
PARTITION BY created_date
CLUSTER BY warehouse_region, category
OPTIONS (
    partition_expiration_days = 730,
    require_partition_filter = TRUE,
    description = "Order line items — partitioned by created_date, clustered by region and category"
);

-- Add a column (ALTER TABLE is supported in BigQuery)
ALTER TABLE `myproject.sales.order_items`
ADD COLUMN IF NOT EXISTS discount_pct FLOAT64;

-- Create a view referencing a parameterized date (use current_date for rolling windows)
CREATE OR REPLACE VIEW `myproject.sales.last_30_days_orders` AS
SELECT *
FROM `myproject.sales.order_items`
WHERE created_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

Summary

BigQuery's serverless, columnar architecture means performance is primarily governed by data scanned, not compute provisioned. Master these three fundamentals and you will cut costs and improve query speed dramatically:

Partition on the time column that appears most often in WHERE clauses. Set require_partition_filter on large tables.
Cluster on the 2–4 columns most commonly used in filters and joins, ordered by selectivity.
Use INFORMATION_SCHEMA to monitor table sizes, job costs, and schema drift in production pipelines.

The next lesson covers BigQuery's SQL dialect in depth — window functions, scripting, BigQuery ML, and geospatial queries.

BigQuery Fundamentals for Data Engineers

What Makes BigQuery Different

Architecture Overview

Resource Hierarchy: Projects, Datasets, Tables

Interacting with BigQuery

BigQuery Console

bq CLI

Python Client

Data Types

Primitive Types

STRUCT (Record Type)

ARRAY Type

JSON Type

GEOGRAPHY Type

Table Types

Native Tables

External Tables

Views

Materialized Views

Partitioning

Partition by DATE / DATETIME / TIMESTAMP

Partition by TIMESTAMP Column (Ingestion Time)

Integer Range Partitioning

Clustering

Loading Data

bq load (batch, Cloud Storage → BigQuery)

Streaming Inserts

Cloud Storage Transfer

Exporting Data

INFORMATION_SCHEMA: Table Metadata Queries

BigQuery vs Snowflake vs Redshift

Quick Reference: Common DDL Patterns

Summary

Enjoyed this article?

Leave a comment