BigQuery Fundamentals for Data Engineers
Master Google BigQuery from the ground up — architecture, storage, partitioning, clustering, loading strategies, and the full SQL feature set data engineers use in production.
What Makes BigQuery Different
BigQuery is Google's fully managed, serverless data warehouse. You do not provision clusters, tune memory settings, or manage storage — you write SQL and Google handles the rest. That serverless model changes the economics of analytics: storage and compute are billed separately, idle time costs nothing, and a single query can scan petabytes in seconds.
Three engineering decisions underpin every BigQuery performance and cost conversation:
- Columnar storage via Capacitor — data is stored in Google's proprietary columnar format, so a query reading 3 of 50 columns touches roughly 6 % of the data a row store would read.
- Dremel execution engine — queries fan out across thousands of worker nodes automatically; you observe the parallelism only as fast results.
- Separate compute and storage — compute scales to zero when idle; you pay for storage (compressed bytes) and compute (bytes scanned by queries).
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ BigQuery Service │
│ │
│ ┌──────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │ Dremel │ │ Colossus │ │ Jupiter Network │ │
│ │ (compute)│ │ (storage) │ │ (10 Pb/s xfer) │ │
│ └────┬─────┘ └──────┬──────┘ └────────┬─────────┘ │
│ │ │ │ │
│ └────────────────┴────────────────────┘ │
└─────────────────────────────────────────────────────────┘
▲ ▲
│ SQL queries │ Streaming / batch load
Client tools Source dataColossus is Google's distributed file system — it stores Capacitor-formatted column files. Because compute (Dremel) and storage (Colossus) communicate over the ultra-fast Jupiter network, there is no I/O bottleneck between a "compute node" and "its" disk. This is why BigQuery outperforms traditional MPP systems at scale.
Resource Hierarchy: Projects, Datasets, Tables
Every BigQuery resource lives inside a three-level hierarchy:
GCP Project (billing, IAM, quotas)
└── Dataset (namespace, access controls, location, default TTL)
├── Table (native or external data)
├── View (saved SQL query, virtual table)
└── Materialized View (cached query result, auto-refreshed)- A project holds the billing account and is the unit for quota enforcement.
- A dataset is a namespace. Tables inside a dataset inherit its region (US, EU, asia-northeast1, etc.). You cannot join tables across regions in a single query.
- A table is the actual data container.
Fully qualified table references follow the pattern: project_id.dataset_id.table_id.
Interacting with BigQuery
BigQuery Console
The web UI at console.cloud.google.com/bigquery supports ad-hoc queries, schema browsing, and job history. Useful for exploration; avoid it for automated pipelines.
bq CLI
The bq command-line tool ships with the Google Cloud SDK.
# Authenticate
gcloud auth application-default login
# Run a query (prints results to stdout)
bq query --use_legacy_sql=false \
"SELECT COUNT(*) FROM \`myproject.mydataset.orders\`"
# Show table schema
bq show --schema --format=prettyjson myproject:mydataset.orders
# Load CSV from Cloud Storage
bq load \
--source_format=CSV \
--autodetect \
myproject:mydataset.orders \
gs://my-bucket/orders/*.csv
# Export table to Cloud Storage as Parquet
bq extract \
--destination_format=PARQUET \
myproject:mydataset.orders \
gs://my-bucket/export/orders_*.parquet
# Create a dataset in a specific region
bq mk --location=EU myproject:analytics
# Estimate bytes scanned before running (dry run)
bq query --use_legacy_sql=false --dry_run \
"SELECT * FROM \`myproject.mydataset.events\` WHERE event_date = '2026-01-01'"Python Client
from google.cloud import bigquery
# Application Default Credentials (ADC) — recommended for Cloud environments
client = bigquery.Client(project="myproject")
# Run a query
query = """
SELECT
user_id,
COUNT(*) AS session_count
FROM `myproject.analytics.sessions`
WHERE session_date BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY user_id
ORDER BY session_count DESC
LIMIT 100
"""
results = client.query(query).to_dataframe()
print(results.head())Data Types
Primitive Types
| Type | Notes | |------|-------| | INT64 / INTEGER | 64-bit signed integer | | FLOAT64 / FLOAT | Double-precision float | | NUMERIC | 29 digits of precision, 9 decimal scale — use for money | | BIGNUMERIC | 76 digits of precision, 38 decimal scale | | STRING | UTF-8, up to 16 MB | | BYTES | Raw bytes | | BOOL | true / false | | DATE | Calendar date, no time | | TIME | Time of day, no date | | DATETIME | Date + time, no timezone | | TIMESTAMP | Absolute point in time (UTC internally) |
STRUCT (Record Type)
A STRUCT is an ordered collection of typed fields — like a row embedded inside a column:
-- Table schema contains a STRUCT column named "address"
SELECT
customer_id,
address.street,
address.city,
address.postal_code
FROM `myproject.crm.customers`
WHERE address.country = 'DE';You can also construct STRUCTs inline:
SELECT
order_id,
STRUCT(
ship_to_name AS name,
ship_to_street AS street,
ship_to_city AS city
) AS shipping_address
FROM `myproject.sales.orders`;ARRAY Type
ARRAYs hold zero or more values of the same type. Every row can have a different number of elements.
-- Select the first tag from an array column
SELECT
post_id,
tags[OFFSET(0)] AS primary_tag,
ARRAY_LENGTH(tags) AS tag_count
FROM `myproject.blog.posts`;
-- Flatten arrays: one output row per element
SELECT
post_id,
tag
FROM `myproject.blog.posts`,
UNNEST(tags) AS tag;
-- Build an array with ARRAY_AGG
SELECT
user_id,
ARRAY_AGG(DISTINCT product_id ORDER BY product_id) AS purchased_products
FROM `myproject.sales.orders`
GROUP BY user_id;JSON Type
BigQuery's native JSON type stores semi-structured data efficiently without casting to STRING:
-- Insert a JSON value
INSERT INTO `myproject.events.raw` (event_id, payload)
VALUES (1, JSON '{"action":"click","element":"btn-signup","ms":342}');
-- Extract fields from JSON
SELECT
event_id,
STRING(payload.action) AS action,
INT64(payload.ms) AS latency_ms
FROM `myproject.events.raw`;
-- Use JSON_VALUE for legacy JSON stored as STRING
SELECT
JSON_VALUE(payload_str, '$.action') AS action,
CAST(JSON_VALUE(payload_str, '$.ms') AS INT64) AS latency_ms
FROM `myproject.events.legacy`;GEOGRAPHY Type
Stores geographic points, lines, and polygons using the WGS84 coordinate system:
SELECT
store_name,
ST_DISTANCE(
location,
ST_GEOGPOINT(-73.9857, 40.7484) -- Times Square
) / 1000 AS distance_km
FROM `myproject.retail.stores`
WHERE ST_DWithin(location, ST_GEOGPOINT(-73.9857, 40.7484), 5000) -- within 5 km
ORDER BY distance_km;Table Types
Native Tables
Standard BigQuery tables — data is stored in Capacitor format in Colossus. All performance optimizations (partitioning, clustering) apply here.
External Tables
Data stays in Cloud Storage, Google Drive, Bigtable, or Cloud Spanner. BigQuery reads it on-demand. Useful for landing raw data before transformation.
-- Create an external table over Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `myproject.raw.clickstream`
OPTIONS (
format = 'PARQUET',
uris = ['gs://my-bucket/clickstream/year=2026/month=*/*.parquet']
);External tables do not benefit from partitioning pruning or clustering. Always materialize frequently queried external data into native tables.
Views
A saved SQL statement treated as a virtual table. No data is stored; the query runs every time the view is referenced.
CREATE OR REPLACE VIEW `myproject.analytics.active_users` AS
SELECT
user_id,
MAX(session_date) AS last_active,
COUNT(*) AS total_sessions
FROM `myproject.analytics.sessions`
WHERE session_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY user_id;Materialized Views
Like a view, but BigQuery caches the result and refreshes it automatically (up to every 5 minutes, or when the base table changes).
CREATE MATERIALIZED VIEW `myproject.analytics.daily_revenue`
PARTITION BY revenue_date
CLUSTER BY region
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)
AS
SELECT
DATE(order_timestamp) AS revenue_date,
region,
SUM(order_total) AS total_revenue,
COUNT(*) AS order_count
FROM `myproject.sales.orders`
GROUP BY revenue_date, region;Queries that filter on revenue_date or region can use the materialized view automatically via smart tuning — even if they query the base orders table directly.
Partitioning
Partitioning splits a table into segments so queries can skip irrelevant data. BigQuery supports four partition strategies.
Partition by DATE / DATETIME / TIMESTAMP
CREATE TABLE `myproject.analytics.events`
(
event_id STRING NOT NULL,
event_date DATE NOT NULL,
user_id STRING,
event_type STRING,
properties JSON
)
PARTITION BY event_date
OPTIONS (
partition_expiration_days = 365,
require_partition_filter = TRUE -- force callers to filter on event_date
);require_partition_filter = TRUE is a production safety net — it prevents accidental full-table scans that could cost thousands of dollars.
Partition by TIMESTAMP Column (Ingestion Time)
CREATE TABLE `myproject.logs.app_logs`
(
log_id STRING,
severity STRING,
message STRING,
logged_at TIMESTAMP
)
PARTITION BY TIMESTAMP_TRUNC(logged_at, DAY);BigQuery automatically creates a partition for each day and tracks partition boundaries.
Integer Range Partitioning
For non-time dimensions like customer region codes or numeric IDs:
CREATE TABLE `myproject.sales.transactions`
(
transaction_id INT64,
region_code INT64,
amount NUMERIC
)
PARTITION BY RANGE_BUCKET(region_code, GENERATE_ARRAY(0, 100, 10));
-- Creates partitions: [0,10), [10,20), ..., [90,100), plus overflowClustering
Clustering sorts the data within each partition (or within the whole table if unpartitioned) by up to 4 columns. BigQuery uses block metadata to skip entire file blocks when a query filters on clustered columns.
CREATE TABLE `myproject.analytics.pageviews`
(
view_id STRING,
view_date DATE,
user_id STRING,
country STRING,
device_type STRING,
page_path STRING,
duration_ms INT64
)
PARTITION BY view_date
CLUSTER BY country, device_type, user_id;
-- Column order matters: filter on country first for best pruningRules of thumb:
- Put the highest-cardinality filter columns first in the cluster order.
- Clustering is most effective on columns frequently used in
WHERE,JOIN, orGROUP BY. - Re-clustering happens automatically over time as new data arrives; you can also manually trigger it.
-- Check if a table needs re-clustering
SELECT
table_name,
total_rows,
total_logical_bytes / POW(1024, 3) AS size_gb,
last_modified_time
FROM `myproject.analytics.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 'pageviews';Loading Data
bq load (batch, Cloud Storage → BigQuery)
# Load Newline-Delimited JSON
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--schema=schema.json \
myproject:analytics.events \
gs://my-bucket/events/2026/05/07/*.ndjson
# Load Parquet with auto-detect schema
bq load \
--source_format=PARQUET \
--autodetect \
--replace \
myproject:analytics.events \
gs://my-bucket/events/2026/05/07/*.parquetStreaming Inserts
Streaming inserts write rows immediately (available for queries within seconds), but cost more per byte than batch loads:
from google.cloud import bigquery
client = bigquery.Client()
table_ref = "myproject.analytics.events"
rows = [
{"event_id": "e001", "event_date": "2026-05-07", "event_type": "signup"},
{"event_id": "e002", "event_date": "2026-05-07", "event_type": "login"},
]
errors = client.insert_rows_json(table_ref, rows)
if errors:
raise RuntimeError(f"Streaming insert errors: {errors}")In production, use streaming inserts only when sub-minute latency is required. Otherwise prefer batch loads — they are cheaper and do not count against streaming quotas.
Cloud Storage Transfer
For large, recurring loads, use Cloud Storage as the staging layer. Drop files from your source system into GCS, then trigger a BigQuery load job via Pub/Sub notification or Cloud Scheduler.
Exporting Data
# Export to Parquet (recommended for downstream processing)
bq extract \
--destination_format=PARQUET \
--compression=SNAPPY \
myproject:analytics.pageviews \
gs://my-bucket/exports/pageviews_*.parquet
# Export a query result (not the full table)
bq query \
--use_legacy_sql=false \
--destination_table=myproject:temp.export_result \
--replace \
"SELECT * FROM \`myproject.analytics.pageviews\` WHERE view_date = '2026-05-01'"
bq extract myproject:temp.export_result gs://my-bucket/exports/pageviews_20260501_*.csvINFORMATION_SCHEMA: Table Metadata Queries
BigQuery exposes rich metadata through INFORMATION_SCHEMA views — essential for platform-level observability.
-- All tables in a dataset with size and row count
SELECT
table_name,
table_type,
creation_time,
last_modified_time,
row_count,
ROUND(size_bytes / POW(1024, 3), 2) AS size_gb,
clustering_fields,
time_partitioning_type,
time_partitioning_field
FROM `myproject.analytics.INFORMATION_SCHEMA.TABLE_OPTIONS`
JOIN `myproject.analytics.INFORMATION_SCHEMA.TABLES` USING (table_name)
ORDER BY size_bytes DESC;
-- Column-level metadata
SELECT
table_name,
column_name,
ordinal_position,
data_type,
is_nullable
FROM `myproject.analytics.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'events'
ORDER BY ordinal_position;
-- Job history: queries in the last 24 hours, sorted by bytes billed
SELECT
job_id,
user_email,
query,
ROUND(total_bytes_billed / POW(1024, 4), 4) AS tb_billed,
ROUND(total_slot_ms / 1000, 1) AS slot_seconds,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_sec,
state
FROM `myproject.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
ORDER BY total_bytes_billed DESC
LIMIT 50;BigQuery vs Snowflake vs Redshift
| Dimension | BigQuery | Snowflake | Amazon Redshift | |-----------|----------|-----------|-----------------| | Compute model | Serverless (auto-scales) | Virtual warehouses (manual sizing) | Provisioned clusters + Serverless option | | Pricing | Per-query bytes scanned + storage | Per warehouse-hour + storage | Per node-hour + storage | | Idle cost | $0 | Warehouse must be paused manually | Cluster runs 24/7 unless scaled to 0 | | Storage format | Capacitor (proprietary columnar) | PAX (hybrid row/column) | Redshift-specific columnar | | SQL dialect | Standard SQL + extensions | ANSI SQL + extensions | PostgreSQL-based | | Semi-structured | Native JSON, STRUCT, ARRAY | VARIANT type | SUPER type | | Geospatial | Native ST_ functions | Limited | PostGIS-like via extensions | | ML integration | BigQuery ML (native CREATE MODEL) | Snowpark ML | Amazon Redshift ML | | External tables | GCS, Drive, Bigtable, Spanner | S3, GCS, Azure Blob | S3 (Spectrum) | | Max query size | No practical limit | No practical limit | 16 MB query text | | Best for | GCP ecosystem, ad-hoc at scale | Cross-cloud, governed enterprise | AWS ecosystem, existing Redshift users |
Quick Reference: Common DDL Patterns
-- Partitioned and clustered table (production template)
CREATE TABLE IF NOT EXISTS `myproject.sales.order_items`
(
order_id STRING NOT NULL,
item_id STRING NOT NULL,
created_date DATE NOT NULL,
product_id STRING,
category STRING,
unit_price NUMERIC(10, 2),
quantity INT64,
warehouse_region STRING,
metadata JSON
)
PARTITION BY created_date
CLUSTER BY warehouse_region, category
OPTIONS (
partition_expiration_days = 730,
require_partition_filter = TRUE,
description = "Order line items — partitioned by created_date, clustered by region and category"
);
-- Add a column (ALTER TABLE is supported in BigQuery)
ALTER TABLE `myproject.sales.order_items`
ADD COLUMN IF NOT EXISTS discount_pct FLOAT64;
-- Create a view referencing a parameterized date (use current_date for rolling windows)
CREATE OR REPLACE VIEW `myproject.sales.last_30_days_orders` AS
SELECT *
FROM `myproject.sales.order_items`
WHERE created_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);Summary
BigQuery's serverless, columnar architecture means performance is primarily governed by data scanned, not compute provisioned. Master these three fundamentals and you will cut costs and improve query speed dramatically:
- Partition on the time column that appears most often in
WHEREclauses. Setrequire_partition_filteron large tables. - Cluster on the 2–4 columns most commonly used in filters and joins, ordered by selectivity.
- Use INFORMATION_SCHEMA to monitor table sizes, job costs, and schema drift in production pipelines.
The next lesson covers BigQuery's SQL dialect in depth — window functions, scripting, BigQuery ML, and geospatial queries.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.