Snowflake — Data Warehousing & Snowpark · Lesson 1 of 1
Snowflake: Data Warehousing, Snowpark & Data Sharing
Snowflake is the cloud data platform built from scratch for the cloud era — separating storage and compute so you can scale each independently. It runs on AWS, Azure, and GCP, and its killer feature for enterprises is seamless, secure data sharing: share live data across accounts without copying it.
Architecture: Separation of Storage and Compute
┌──────────────────────────────────────────────────────────────┐
│ Snowflake Architecture │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Query Layer (Cloud Services) │ │
│ │ Optimizer, metadata, auth, transactions, security │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Compute Layer (Virtual Warehouses) │ │
│ │ XS / S / M / L / XL — independent clusters │ │
│ │ Auto-suspend + auto-resume — pay only when active │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Storage Layer (Micro-partitions) │ │
│ │ Columnar, compressed Parquet — S3/Azure Blob/GCS │ │
│ │ Automatic clustering, pruning, metadata indexing │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘Key Architectural Concepts
Micro-partitions: Snowflake stores data in 50–500 MB compressed micro-partitions. Each partition has column-level metadata (min, max, count, distinct values) — Snowflake prunes partitions without reading data, like a file-level index for every table.
Virtual Warehouses: Independently-sized compute clusters. Multiple warehouses can query the same data simultaneously. Each has its own credit consumption, scaling policy, and timeout.
Time Travel + Fail-Safe: Every DML operation creates a new micro-partition version. Time travel lets you query data at any point in the past (up to 90 days on Enterprise). Fail-safe adds 7 days of non-queryable recovery behind Snowflake support.
Virtual Warehouses — Compute Sizing
-- Create a warehouse with auto-suspend
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300 -- suspend after 5 min idle
AUTO_RESUME = TRUE -- resume automatically on query
MAX_CLUSTER_COUNT = 3 -- multi-cluster: scale out under concurrency
MIN_CLUSTER_COUNT = 1
SCALING_POLICY = 'STANDARD';
-- Switch warehouse in session
USE WAREHOUSE analytics_wh;
-- Monitor warehouse credit usage
SELECT warehouse_name,
SUM(credits_used) AS total_credits,
COUNT(*) AS query_count,
AVG(execution_time)/1000 AS avg_sec
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY total_credits DESC;Warehouse Sizing Guide
| Size | vCores | Use case | |------|--------|----------| | XS | 1 | Development, light queries | | S | 2 | Small ETL, dashboards | | M | 4 | Standard analytics workloads | | L | 8 | Heavy transforms, large joins | | XL | 16 | Very large joins, ML feature prep | | 2XL–6XL | 32–512 | Bulk loading, data science |
Rule of thumb: double the size = roughly half the runtime = same credit cost. Choose based on query latency requirements, not cost alone.
Database Objects: Hierarchy
-- Three-level namespace: database.schema.object
CREATE DATABASE analytics_prod;
CREATE SCHEMA analytics_prod.sales;
-- Table
CREATE OR REPLACE TABLE analytics_prod.sales.orders (
order_id VARCHAR(36) NOT NULL,
customer_id VARCHAR(36) NOT NULL,
amount NUMBER(12, 2) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
CONSTRAINT pk_orders PRIMARY KEY (order_id)
);
-- Clustering key: improves pruning on large tables queried by a specific column
ALTER TABLE analytics_prod.sales.orders
CLUSTER BY (DATE_TRUNC('month', created_at));Semi-Structured Data: VARIANT
Snowflake's VARIANT column stores JSON, XML, or Avro natively without schema definition:
-- VARIANT column for flexible JSON
CREATE TABLE events (
event_id VARCHAR(36),
event_type VARCHAR(100),
payload VARIANT, -- stores raw JSON
received_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Insert JSON
INSERT INTO events (event_id, event_type, payload)
SELECT UUID_STRING(),
'order_placed',
PARSE_JSON('{
"order_id": "ORD-001",
"customer": {"id": "C-123", "name": "Alice"},
"items": [{"sku": "SKU-1", "qty": 2, "price": 29.99}]
}');
-- Query JSON with : and [] notation
SELECT
payload:order_id::VARCHAR AS order_id,
payload:customer.name::VARCHAR AS customer_name,
payload:items[0].price::FLOAT AS first_item_price,
ARRAY_SIZE(payload:items) AS item_count
FROM events
WHERE event_type = 'order_placed';
-- FLATTEN: expand array elements into rows
SELECT
e.event_id,
item.value:sku::VARCHAR AS sku,
item.value:qty::INT AS quantity,
item.value:price::FLOAT AS unit_price
FROM events e,
LATERAL FLATTEN(INPUT => e.payload:items) item;SQL Analytics — Window Functions and Advanced Queries
Snowflake supports all standard SQL analytics functions:
-- Running total and rank within group
SELECT
customer_id,
order_id,
amount,
created_at,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY created_at
ROWS UNBOUNDED PRECEDING
) AS lifetime_spend,
RANK() OVER (
PARTITION BY DATE_TRUNC('month', created_at)
ORDER BY amount DESC
) AS rank_this_month,
LAG(amount) OVER (
PARTITION BY customer_id
ORDER BY created_at
) AS prev_order_amount
FROM orders
WHERE status = 'completed';
-- QUALIFY: filter on window function result (Snowflake extension)
SELECT customer_id, order_id, amount, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY created_at DESC
) = 1; -- latest order per customer — no subquery neededMERGE for Upserts
MERGE INTO analytics_prod.silver.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
UPDATE SET
name = source.name,
email = source.email,
updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
CURRENT_TIMESTAMP(), source.updated_at);Time Travel
-- Query data at a specific point in the past
SELECT * FROM orders AT (TIMESTAMP => '2026-04-10 09:00:00'::TIMESTAMP_NTZ);
-- Query data before a specific DML statement
SELECT * FROM orders BEFORE (STATEMENT => '01b3d5a8-0000-1234-abcd-1234567890ab');
-- Restore a table to a previous state
CREATE OR REPLACE TABLE orders AS
SELECT * FROM orders AT (TIMESTAMP => '2026-04-09 00:00:00'::TIMESTAMP_NTZ);
-- Clone table to a historical point (zero-copy — no storage until writes diverge)
CREATE TABLE orders_backup CLONE orders
AT (TIMESTAMP => '2026-04-15 18:00:00'::TIMESTAMP_NTZ);
-- View table history and associated storage
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS(
TABLE_NAME => 'ORDERS',
TABLE_SCHEMA => 'SALES'
));Zero-Copy Cloning
Cloning creates an instant, independent copy of any database object (table, schema, database) pointing to the same storage. You only pay for new writes after the clone diverges — useful for dev/test environments, point-in-time snapshots, and safe data experimentation.
-- Clone entire database for testing
CREATE DATABASE analytics_staging CLONE analytics_prod;
-- Clone at a point in time
CREATE TABLE orders_march CLONE orders
AT (TIMESTAMP => '2026-03-31 23:59:59'::TIMESTAMP_NTZ);Dynamic Tables — Declarative Pipelines
Dynamic Tables are Snowflake's answer to materialized views with automatic, incremental refresh:
-- Define a dynamic table — Snowflake manages refresh automatically
CREATE OR REPLACE DYNAMIC TABLE silver.customer_monthly_stats
TARGET_LAG = '1 hour' -- max data freshness
WAREHOUSE = etl_wh
AS
SELECT
customer_id,
DATE_TRUNC('month', created_at) AS month,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
AVG(amount) AS avg_order_value
FROM bronze.orders
WHERE status = 'completed'
GROUP BY 1, 2;
-- Check refresh history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY(
NAME => 'silver.customer_monthly_stats'
));Dynamic tables replace manual TASK-based refresh pipelines. Chain them to build a declarative Bronze → Silver → Gold Medallion Architecture.
Snowpark — Python in Snowflake
Snowpark brings Python, Java, and Scala directly to Snowflake — your code runs in Snowflake's compute, close to the data:
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_, avg, count, lit, when
from snowflake.snowpark.types import IntegerType, FloatType
connection_params = {
"account": "my-org.snowflakecomputing.com",
"user": "my_service_account",
"authenticator": "externalbrowser", # SSO in interactive sessions
# for service accounts: "private_key_file": "/path/to/key.p8"
"role": "ANALYST_ROLE",
"warehouse": "analytics_wh",
"database": "analytics_prod",
"schema": "sales",
}
session = Session.builder.configs(connection_params).create()
# DataFrame API — lazy evaluation, pushes SQL to Snowflake
orders_df = session.table("orders")
result = (
orders_df
.filter(col("status") == "completed")
.group_by("customer_id")
.agg(
count("*").alias("order_count"),
sum_("amount").alias("total_spend"),
avg("amount").alias("avg_order")
)
.filter(col("total_spend") > 1000)
.sort(col("total_spend").desc())
)
result.show()
# Write result back to Snowflake
result.write.mode("overwrite").save_as_table("gold.high_value_customers")Snowpark UDFs — Custom Python Functions in Snowflake
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType, FloatType
import snowflake.snowpark.functions as F
# Register a scalar UDF — runs per-row inside Snowflake
@udf(name="categorise_amount", replace=True,
return_type=StringType(), input_types=[FloatType()])
def categorise_amount(amount: float) -> str:
if amount < 50:
return "micro"
elif amount < 500:
return "standard"
elif amount < 5000:
return "large"
return "enterprise"
# Use the UDF in a DataFrame
orders_df.with_column(
"category",
F.call_udf("categorise_amount", col("amount"))
).show()Snowpark Stored Procedures — ETL in Python
def transform_orders(session: Session, target_table: str) -> str:
raw = session.table("bronze.raw_orders")
clean = (
raw
.filter(col("order_id").is_not_null())
.drop_duplicates(["order_id"])
.with_column("amount", col("amount").cast(FloatType()))
.with_column("loaded_at", F.current_timestamp())
)
clean.write.mode("overwrite").save_as_table(target_table)
return f"Loaded {clean.count()} rows into {target_table}"
# Register as a stored procedure
session.sproc.register(
func=transform_orders,
name="sp_transform_orders",
replace=True,
is_permanent=True,
stage_location="@my_stage",
packages=["snowflake-snowpark-python"]
)
# Call it
session.call("sp_transform_orders", "silver.orders")Data Sharing — Live Data Without Copying
Snowflake's Data Sharing lets you share live data with other Snowflake accounts (same cloud region or cross-region) without copying, moving, or managing replicas. The consumer queries your data directly in your storage — you maintain full control.
-- Provider side: create a share
CREATE SHARE product_data_share;
GRANT USAGE ON DATABASE analytics_prod TO SHARE product_data_share;
GRANT USAGE ON SCHEMA analytics_prod.gold TO SHARE product_data_share;
GRANT SELECT ON TABLE analytics_prod.gold.product_catalog TO SHARE product_data_share;
GRANT SELECT ON VIEW analytics_prod.gold.public_pricing TO SHARE product_data_share;
-- Add consumer account
ALTER SHARE product_data_share ADD ACCOUNTS = org_name.consumer_account;
-- Consumer side: create a database from the share
CREATE DATABASE partner_products FROM SHARE provider_org.product_data_share;
-- Query it as a normal database
SELECT * FROM partner_products.gold.product_catalog;Snowflake Marketplace
The Snowflake Data Marketplace has 2,500+ live datasets from data providers. Subscribe and query live data (financial markets, weather, demographics, location data) directly in your account — no ETL, no copying:
-- After subscribing to a marketplace dataset, it appears as a database
SELECT *
FROM CYBERSYN__COMPANY_RELATIONSHIP_GRAPH.PUBLIC.COMPANY_INDEX
WHERE name ILIKE '%microsoft%'
LIMIT 20;Streams and Tasks — CDC and Orchestration
Streams track DML changes (insert/update/delete) on tables for CDC patterns:
-- Create a stream on a table
CREATE STREAM orders_stream ON TABLE bronze.raw_orders;
-- The stream captures inserts, updates, deletes with metadata columns:
-- METADATA$ACTION: 'INSERT' or 'DELETE'
-- METADATA$ISUPDATE: TRUE for updates (appear as DELETE + INSERT pair)
-- METADATA$ROW_ID: stable row identifier across DML
-- Consume the stream in a task
SELECT * FROM orders_stream WHERE METADATA$ACTION = 'INSERT';Tasks schedule SQL or Stored Procedure execution:
-- Create a scheduled task
CREATE TASK refresh_silver_orders
WAREHOUSE = etl_wh
SCHEDULE = 'USING CRON 0 * * * * UTC' -- every hour
AS
CALL sp_transform_orders('silver.orders');
-- Task DAG: child task runs after parent
CREATE TASK refresh_gold_stats
WAREHOUSE = etl_wh
AFTER refresh_silver_orders
AS
CREATE OR REPLACE TABLE gold.monthly_stats AS
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM silver.orders
GROUP BY 1;
-- Activate tasks
ALTER TASK refresh_gold_stats RESUME;
ALTER TASK refresh_silver_orders RESUME;Row Access Policies and Column Masking
-- Row access policy: filter rows based on current role
CREATE ROW ACCESS POLICY tenant_isolation
AS (tenant_id VARCHAR) RETURNS BOOLEAN ->
CURRENT_ROLE() IN ('ADMIN_ROLE')
OR tenant_id = CURRENT_USER();
ALTER TABLE silver.orders ADD ROW ACCESS POLICY tenant_isolation ON (tenant_id);
-- Column masking policy: mask PII for non-privileged roles
CREATE MASKING POLICY email_mask
AS (val STRING) RETURNS STRING ->
CASE WHEN CURRENT_ROLE() IN ('PII_ADMIN') THEN val
ELSE REGEXP_REPLACE(val, '.+@', '*****@')
END;
ALTER TABLE silver.customers MODIFY COLUMN email
SET MASKING POLICY email_mask;Cost Management
-- Credit usage by warehouse (last 30 days)
SELECT warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 3.00 AS estimated_usd -- approximate Standard rate
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY total_credits DESC;
-- Most expensive queries
SELECT query_text,
total_elapsed_time / 1000 AS elapsed_sec,
credits_used_cloud_services,
bytes_scanned / 1e9 AS gb_scanned,
partitions_scanned,
partitions_total,
(partitions_scanned / NULLIF(partitions_total, 0)) * 100 AS pct_scanned
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY credits_used_cloud_services DESC NULLS LAST
LIMIT 20;Cost Reduction Patterns
- Auto-suspend: Set warehouses to suspend after 60–300 seconds idle. A suspended warehouse costs nothing.
- Clustering keys: Reduce partitions scanned on large tables filtered by date or tenant.
- QUALIFY instead of subqueries: Avoids redundant table scans for row-number filtering.
- Result Cache: Identical queries within 24 hours serve from the result cache — zero compute cost.
- Search Optimization Service: For point-lookup queries on high-cardinality columns — adds maintenance cost but dramatically reduces query cost on qualifying workloads.
Snowflake vs Databricks vs BigQuery
| | Snowflake | Databricks | BigQuery | |--|-----------|------------|---------| | Core strength | SQL analytics, sharing | Spark + ML | SQL analytics, serverless | | Data sharing | Native, zero-copy | Delta Sharing | Analytics Hub | | Streaming | Snowpipe (S3/Azure trigger) | Structured Streaming | Dataflow / Pub/Sub | | ML/AI | Snowpark ML, Cortex | Native MLflow, Feature Store | Vertex AI | | Pricing model | Credits (compute) + storage | DBUs + storage | Slots + bytes processed | | Language | SQL, Snowpark (Python/Scala/Java) | Python/SQL/Scala/R | SQL, BigQuery ML | | Best for | SQL-first teams, secure sharing | Data engineering + ML | GCP-native analytics |
Related: Databricks Guide — Delta Lake and PySpark
Related: MLflow Experiment Tracking — ML tracking and registry
Related: Azure Cloud Integration — Event Hubs and Azure Data Factory