Snowflake — Data Warehousing, Snowpark & Data Sharing
Production Snowflake guide — virtual warehouses, storage architecture, SQL analytics, Snowpark Python, dynamic tables, data sharing, Marketplace, and cost management. With Python and SQL examples throughout.
Snowflake is the cloud data platform built from scratch for the cloud era — separating storage and compute so you can scale each independently. It runs on AWS, Azure, and GCP, and its killer feature for enterprises is seamless, secure data sharing: share live data across accounts without copying it.
Architecture: Separation of Storage and Compute
┌──────────────────────────────────────────────────────────────┐
│ Snowflake Architecture │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Query Layer (Cloud Services) │ │
│ │ Optimizer, metadata, auth, transactions, security │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Compute Layer (Virtual Warehouses) │ │
│ │ XS / S / M / L / XL — independent clusters │ │
│ │ Auto-suspend + auto-resume — pay only when active │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Storage Layer (Micro-partitions) │ │
│ │ Columnar, compressed Parquet — S3/Azure Blob/GCS │ │
│ │ Automatic clustering, pruning, metadata indexing │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘Key Architectural Concepts
Micro-partitions: Snowflake stores data in 50–500 MB compressed micro-partitions. Each partition has column-level metadata (min, max, count, distinct values) — Snowflake prunes partitions without reading data, like a file-level index for every table.
Virtual Warehouses: Independently-sized compute clusters. Multiple warehouses can query the same data simultaneously. Each has its own credit consumption, scaling policy, and timeout.
Time Travel + Fail-Safe: Every DML operation creates a new micro-partition version. Time travel lets you query data at any point in the past (up to 90 days on Enterprise). Fail-safe adds 7 days of non-queryable recovery behind Snowflake support.
Virtual Warehouses — Compute Sizing
-- Create a warehouse with auto-suspend
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300 -- suspend after 5 min idle
AUTO_RESUME = TRUE -- resume automatically on query
MAX_CLUSTER_COUNT = 3 -- multi-cluster: scale out under concurrency
MIN_CLUSTER_COUNT = 1
SCALING_POLICY = 'STANDARD';
-- Switch warehouse in session
USE WAREHOUSE analytics_wh;
-- Monitor warehouse credit usage
SELECT warehouse_name,
SUM(credits_used) AS total_credits,
COUNT(*) AS query_count,
AVG(execution_time)/1000 AS avg_sec
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY total_credits DESC;Warehouse Sizing Guide
| Size | vCores | Use case | |------|--------|----------| | XS | 1 | Development, light queries | | S | 2 | Small ETL, dashboards | | M | 4 | Standard analytics workloads | | L | 8 | Heavy transforms, large joins | | XL | 16 | Very large joins, ML feature prep | | 2XL–6XL | 32–512 | Bulk loading, data science |
Rule of thumb: double the size = roughly half the runtime = same credit cost. Choose based on query latency requirements, not cost alone.
Database Objects: Hierarchy
-- Three-level namespace: database.schema.object
CREATE DATABASE analytics_prod;
CREATE SCHEMA analytics_prod.sales;
-- Table
CREATE OR REPLACE TABLE analytics_prod.sales.orders (
order_id VARCHAR(36) NOT NULL,
customer_id VARCHAR(36) NOT NULL,
amount NUMBER(12, 2) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
CONSTRAINT pk_orders PRIMARY KEY (order_id)
);
-- Clustering key: improves pruning on large tables queried by a specific column
ALTER TABLE analytics_prod.sales.orders
CLUSTER BY (DATE_TRUNC('month', created_at));Semi-Structured Data: VARIANT
Snowflake's VARIANT column stores JSON, XML, or Avro natively without schema definition:
-- VARIANT column for flexible JSON
CREATE TABLE events (
event_id VARCHAR(36),
event_type VARCHAR(100),
payload VARIANT, -- stores raw JSON
received_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Insert JSON
INSERT INTO events (event_id, event_type, payload)
SELECT UUID_STRING(),
'order_placed',
PARSE_JSON('{
"order_id": "ORD-001",
"customer": {"id": "C-123", "name": "Alice"},
"items": [{"sku": "SKU-1", "qty": 2, "price": 29.99}]
}');
-- Query JSON with : and [] notation
SELECT
payload:order_id::VARCHAR AS order_id,
payload:customer.name::VARCHAR AS customer_name,
payload:items[0].price::FLOAT AS first_item_price,
ARRAY_SIZE(payload:items) AS item_count
FROM events
WHERE event_type = 'order_placed';
-- FLATTEN: expand array elements into rows
SELECT
e.event_id,
item.value:sku::VARCHAR AS sku,
item.value:qty::INT AS quantity,
item.value:price::FLOAT AS unit_price
FROM events e,
LATERAL FLATTEN(INPUT => e.payload:items) item;SQL Analytics — Window Functions and Advanced Queries
Snowflake supports all standard SQL analytics functions:
-- Running total and rank within group
SELECT
customer_id,
order_id,
amount,
created_at,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY created_at
ROWS UNBOUNDED PRECEDING
) AS lifetime_spend,
RANK() OVER (
PARTITION BY DATE_TRUNC('month', created_at)
ORDER BY amount DESC
) AS rank_this_month,
LAG(amount) OVER (
PARTITION BY customer_id
ORDER BY created_at
) AS prev_order_amount
FROM orders
WHERE status = 'completed';
-- QUALIFY: filter on window function result (Snowflake extension)
SELECT customer_id, order_id, amount, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY created_at DESC
) = 1; -- latest order per customer — no subquery neededMERGE for Upserts
MERGE INTO analytics_prod.silver.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
UPDATE SET
name = source.name,
email = source.email,
updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
CURRENT_TIMESTAMP(), source.updated_at);Time Travel
-- Query data at a specific point in the past
SELECT * FROM orders AT (TIMESTAMP => '2026-04-10 09:00:00'::TIMESTAMP_NTZ);
-- Query data before a specific DML statement
SELECT * FROM orders BEFORE (STATEMENT => '01b3d5a8-0000-1234-abcd-1234567890ab');
-- Restore a table to a previous state
CREATE OR REPLACE TABLE orders AS
SELECT * FROM orders AT (TIMESTAMP => '2026-04-09 00:00:00'::TIMESTAMP_NTZ);
-- Clone table to a historical point (zero-copy — no storage until writes diverge)
CREATE TABLE orders_backup CLONE orders
AT (TIMESTAMP => '2026-04-15 18:00:00'::TIMESTAMP_NTZ);
-- View table history and associated storage
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS(
TABLE_NAME => 'ORDERS',
TABLE_SCHEMA => 'SALES'
));Zero-Copy Cloning
Cloning creates an instant, independent copy of any database object (table, schema, database) pointing to the same storage. You only pay for new writes after the clone diverges — useful for dev/test environments, point-in-time snapshots, and safe data experimentation.
-- Clone entire database for testing
CREATE DATABASE analytics_staging CLONE analytics_prod;
-- Clone at a point in time
CREATE TABLE orders_march CLONE orders
AT (TIMESTAMP => '2026-03-31 23:59:59'::TIMESTAMP_NTZ);Dynamic Tables — Declarative Pipelines
Dynamic Tables are Snowflake's answer to materialized views with automatic, incremental refresh:
-- Define a dynamic table — Snowflake manages refresh automatically
CREATE OR REPLACE DYNAMIC TABLE silver.customer_monthly_stats
TARGET_LAG = '1 hour' -- max data freshness
WAREHOUSE = etl_wh
AS
SELECT
customer_id,
DATE_TRUNC('month', created_at) AS month,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
AVG(amount) AS avg_order_value
FROM bronze.orders
WHERE status = 'completed'
GROUP BY 1, 2;
-- Check refresh history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY(
NAME => 'silver.customer_monthly_stats'
));Dynamic tables replace manual TASK-based refresh pipelines. Chain them to build a declarative Bronze → Silver → Gold Medallion Architecture.
Snowpark — Python in Snowflake
Snowpark brings Python, Java, and Scala directly to Snowflake — your code runs in Snowflake's compute, close to the data:
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_, avg, count, lit, when
from snowflake.snowpark.types import IntegerType, FloatType
connection_params = {
"account": "my-org.snowflakecomputing.com",
"user": "my_service_account",
"authenticator": "externalbrowser", # SSO in interactive sessions
# for service accounts: "private_key_file": "/path/to/key.p8"
"role": "ANALYST_ROLE",
"warehouse": "analytics_wh",
"database": "analytics_prod",
"schema": "sales",
}
session = Session.builder.configs(connection_params).create()
# DataFrame API — lazy evaluation, pushes SQL to Snowflake
orders_df = session.table("orders")
result = (
orders_df
.filter(col("status") == "completed")
.group_by("customer_id")
.agg(
count("*").alias("order_count"),
sum_("amount").alias("total_spend"),
avg("amount").alias("avg_order")
)
.filter(col("total_spend") > 1000)
.sort(col("total_spend").desc())
)
result.show()
# Write result back to Snowflake
result.write.mode("overwrite").save_as_table("gold.high_value_customers")Snowpark UDFs — Custom Python Functions in Snowflake
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType, FloatType
import snowflake.snowpark.functions as F
# Register a scalar UDF — runs per-row inside Snowflake
@udf(name="categorise_amount", replace=True,
return_type=StringType(), input_types=[FloatType()])
def categorise_amount(amount: float) -> str:
if amount < 50:
return "micro"
elif amount < 500:
return "standard"
elif amount < 5000:
return "large"
return "enterprise"
# Use the UDF in a DataFrame
orders_df.with_column(
"category",
F.call_udf("categorise_amount", col("amount"))
).show()Snowpark Stored Procedures — ETL in Python
def transform_orders(session: Session, target_table: str) -> str:
raw = session.table("bronze.raw_orders")
clean = (
raw
.filter(col("order_id").is_not_null())
.drop_duplicates(["order_id"])
.with_column("amount", col("amount").cast(FloatType()))
.with_column("loaded_at", F.current_timestamp())
)
clean.write.mode("overwrite").save_as_table(target_table)
return f"Loaded {clean.count()} rows into {target_table}"
# Register as a stored procedure
session.sproc.register(
func=transform_orders,
name="sp_transform_orders",
replace=True,
is_permanent=True,
stage_location="@my_stage",
packages=["snowflake-snowpark-python"]
)
# Call it
session.call("sp_transform_orders", "silver.orders")Data Sharing — Live Data Without Copying
Snowflake's Data Sharing lets you share live data with other Snowflake accounts (same cloud region or cross-region) without copying, moving, or managing replicas. The consumer queries your data directly in your storage — you maintain full control.
-- Provider side: create a share
CREATE SHARE product_data_share;
GRANT USAGE ON DATABASE analytics_prod TO SHARE product_data_share;
GRANT USAGE ON SCHEMA analytics_prod.gold TO SHARE product_data_share;
GRANT SELECT ON TABLE analytics_prod.gold.product_catalog TO SHARE product_data_share;
GRANT SELECT ON VIEW analytics_prod.gold.public_pricing TO SHARE product_data_share;
-- Add consumer account
ALTER SHARE product_data_share ADD ACCOUNTS = org_name.consumer_account;
-- Consumer side: create a database from the share
CREATE DATABASE partner_products FROM SHARE provider_org.product_data_share;
-- Query it as a normal database
SELECT * FROM partner_products.gold.product_catalog;Snowflake Marketplace
The Snowflake Data Marketplace has 2,500+ live datasets from data providers. Subscribe and query live data (financial markets, weather, demographics, location data) directly in your account — no ETL, no copying:
-- After subscribing to a marketplace dataset, it appears as a database
SELECT *
FROM CYBERSYN__COMPANY_RELATIONSHIP_GRAPH.PUBLIC.COMPANY_INDEX
WHERE name ILIKE '%microsoft%'
LIMIT 20;Streams and Tasks — CDC and Orchestration
Streams track DML changes (insert/update/delete) on tables for CDC patterns:
-- Create a stream on a table
CREATE STREAM orders_stream ON TABLE bronze.raw_orders;
-- The stream captures inserts, updates, deletes with metadata columns:
-- METADATA$ACTION: 'INSERT' or 'DELETE'
-- METADATA$ISUPDATE: TRUE for updates (appear as DELETE + INSERT pair)
-- METADATA$ROW_ID: stable row identifier across DML
-- Consume the stream in a task
SELECT * FROM orders_stream WHERE METADATA$ACTION = 'INSERT';Tasks schedule SQL or Stored Procedure execution:
-- Create a scheduled task
CREATE TASK refresh_silver_orders
WAREHOUSE = etl_wh
SCHEDULE = 'USING CRON 0 * * * * UTC' -- every hour
AS
CALL sp_transform_orders('silver.orders');
-- Task DAG: child task runs after parent
CREATE TASK refresh_gold_stats
WAREHOUSE = etl_wh
AFTER refresh_silver_orders
AS
CREATE OR REPLACE TABLE gold.monthly_stats AS
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM silver.orders
GROUP BY 1;
-- Activate tasks
ALTER TASK refresh_gold_stats RESUME;
ALTER TASK refresh_silver_orders RESUME;Row Access Policies and Column Masking
-- Row access policy: filter rows based on current role
CREATE ROW ACCESS POLICY tenant_isolation
AS (tenant_id VARCHAR) RETURNS BOOLEAN ->
CURRENT_ROLE() IN ('ADMIN_ROLE')
OR tenant_id = CURRENT_USER();
ALTER TABLE silver.orders ADD ROW ACCESS POLICY tenant_isolation ON (tenant_id);
-- Column masking policy: mask PII for non-privileged roles
CREATE MASKING POLICY email_mask
AS (val STRING) RETURNS STRING ->
CASE WHEN CURRENT_ROLE() IN ('PII_ADMIN') THEN val
ELSE REGEXP_REPLACE(val, '.+@', '*****@')
END;
ALTER TABLE silver.customers MODIFY COLUMN email
SET MASKING POLICY email_mask;Cost Management
-- Credit usage by warehouse (last 30 days)
SELECT warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 3.00 AS estimated_usd -- approximate Standard rate
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY total_credits DESC;
-- Most expensive queries
SELECT query_text,
total_elapsed_time / 1000 AS elapsed_sec,
credits_used_cloud_services,
bytes_scanned / 1e9 AS gb_scanned,
partitions_scanned,
partitions_total,
(partitions_scanned / NULLIF(partitions_total, 0)) * 100 AS pct_scanned
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY credits_used_cloud_services DESC NULLS LAST
LIMIT 20;Cost Reduction Patterns
- Auto-suspend: Set warehouses to suspend after 60–300 seconds idle. A suspended warehouse costs nothing.
- Clustering keys: Reduce partitions scanned on large tables filtered by date or tenant.
- QUALIFY instead of subqueries: Avoids redundant table scans for row-number filtering.
- Result Cache: Identical queries within 24 hours serve from the result cache — zero compute cost.
- Search Optimization Service: For point-lookup queries on high-cardinality columns — adds maintenance cost but dramatically reduces query cost on qualifying workloads.
Snowflake vs Databricks vs BigQuery
| | Snowflake | Databricks | BigQuery | |--|-----------|------------|---------| | Core strength | SQL analytics, sharing | Spark + ML | SQL analytics, serverless | | Data sharing | Native, zero-copy | Delta Sharing | Analytics Hub | | Streaming | Snowpipe (S3/Azure trigger) | Structured Streaming | Dataflow / Pub/Sub | | ML/AI | Snowpark ML, Cortex | Native MLflow, Feature Store | Vertex AI | | Pricing model | Credits (compute) + storage | DBUs + storage | Slots + bytes processed | | Language | SQL, Snowpark (Python/Scala/Java) | Python/SQL/Scala/R | SQL, BigQuery ML | | Best for | SQL-first teams, secure sharing | Data engineering + ML | GCP-native analytics |
Related: Databricks Guide — Delta Lake and PySpark
Related: MLflow Experiment Tracking — ML tracking and registry
Related: Azure Cloud Integration — Event Hubs and Azure Data Factory
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.