Learnixo

Snowflake — Data Warehousing & Snowpark · Lesson 1 of 1

Snowflake: Data Warehousing, Snowpark & Data Sharing

Snowflake is the cloud data platform built from scratch for the cloud era — separating storage and compute so you can scale each independently. It runs on AWS, Azure, and GCP, and its killer feature for enterprises is seamless, secure data sharing: share live data across accounts without copying it.


Architecture: Separation of Storage and Compute

┌──────────────────────────────────────────────────────────────┐
│                     Snowflake Architecture                    │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Query Layer (Cloud Services)               │  │
│  │  Optimizer, metadata, auth, transactions, security      │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Compute Layer (Virtual Warehouses)         │  │
│  │  XS / S / M / L / XL — independent clusters           │  │
│  │  Auto-suspend + auto-resume — pay only when active     │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Storage Layer (Micro-partitions)           │  │
│  │  Columnar, compressed Parquet — S3/Azure Blob/GCS      │  │
│  │  Automatic clustering, pruning, metadata indexing      │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Key Architectural Concepts

Micro-partitions: Snowflake stores data in 50–500 MB compressed micro-partitions. Each partition has column-level metadata (min, max, count, distinct values) — Snowflake prunes partitions without reading data, like a file-level index for every table.

Virtual Warehouses: Independently-sized compute clusters. Multiple warehouses can query the same data simultaneously. Each has its own credit consumption, scaling policy, and timeout.

Time Travel + Fail-Safe: Every DML operation creates a new micro-partition version. Time travel lets you query data at any point in the past (up to 90 days on Enterprise). Fail-safe adds 7 days of non-queryable recovery behind Snowflake support.


Virtual Warehouses — Compute Sizing

SQL
-- Create a warehouse with auto-suspend
CREATE WAREHOUSE analytics_wh
    WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 300        -- suspend after 5 min idle
    AUTO_RESUME = TRUE        -- resume automatically on query
    MAX_CLUSTER_COUNT = 3     -- multi-cluster: scale out under concurrency
    MIN_CLUSTER_COUNT = 1
    SCALING_POLICY = 'STANDARD';

-- Switch warehouse in session
USE WAREHOUSE analytics_wh;

-- Monitor warehouse credit usage
SELECT warehouse_name,
       SUM(credits_used) AS total_credits,
       COUNT(*)           AS query_count,
       AVG(execution_time)/1000 AS avg_sec
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY total_credits DESC;

Warehouse Sizing Guide

| Size | vCores | Use case | |------|--------|----------| | XS | 1 | Development, light queries | | S | 2 | Small ETL, dashboards | | M | 4 | Standard analytics workloads | | L | 8 | Heavy transforms, large joins | | XL | 16 | Very large joins, ML feature prep | | 2XL–6XL | 32–512 | Bulk loading, data science |

Rule of thumb: double the size = roughly half the runtime = same credit cost. Choose based on query latency requirements, not cost alone.


Database Objects: Hierarchy

SQL
-- Three-level namespace: database.schema.object
CREATE DATABASE analytics_prod;
CREATE SCHEMA analytics_prod.sales;

-- Table
CREATE OR REPLACE TABLE analytics_prod.sales.orders (
    order_id    VARCHAR(36)    NOT NULL,
    customer_id VARCHAR(36)    NOT NULL,
    amount      NUMBER(12, 2)  NOT NULL,
    status      VARCHAR(20)    DEFAULT 'pending',
    created_at  TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP(),
    CONSTRAINT pk_orders PRIMARY KEY (order_id)
);

-- Clustering key: improves pruning on large tables queried by a specific column
ALTER TABLE analytics_prod.sales.orders
    CLUSTER BY (DATE_TRUNC('month', created_at));

Semi-Structured Data: VARIANT

Snowflake's VARIANT column stores JSON, XML, or Avro natively without schema definition:

SQL
-- VARIANT column for flexible JSON
CREATE TABLE events (
    event_id   VARCHAR(36),
    event_type VARCHAR(100),
    payload    VARIANT,        -- stores raw JSON
    received_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Insert JSON
INSERT INTO events (event_id, event_type, payload)
SELECT UUID_STRING(),
       'order_placed',
       PARSE_JSON('{
           "order_id": "ORD-001",
           "customer": {"id": "C-123", "name": "Alice"},
           "items": [{"sku": "SKU-1", "qty": 2, "price": 29.99}]
       }');

-- Query JSON with : and [] notation
SELECT
    payload:order_id::VARCHAR              AS order_id,
    payload:customer.name::VARCHAR         AS customer_name,
    payload:items[0].price::FLOAT          AS first_item_price,
    ARRAY_SIZE(payload:items)              AS item_count
FROM events
WHERE event_type = 'order_placed';

-- FLATTEN: expand array elements into rows
SELECT
    e.event_id,
    item.value:sku::VARCHAR    AS sku,
    item.value:qty::INT        AS quantity,
    item.value:price::FLOAT    AS unit_price
FROM events e,
LATERAL FLATTEN(INPUT => e.payload:items) item;

SQL Analytics — Window Functions and Advanced Queries

Snowflake supports all standard SQL analytics functions:

SQL
-- Running total and rank within group
SELECT
    customer_id,
    order_id,
    amount,
    created_at,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
        ROWS UNBOUNDED PRECEDING
    ) AS lifetime_spend,
    RANK() OVER (
        PARTITION BY DATE_TRUNC('month', created_at)
        ORDER BY amount DESC
    ) AS rank_this_month,
    LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
    ) AS prev_order_amount
FROM orders
WHERE status = 'completed';

-- QUALIFY: filter on window function result (Snowflake extension)
SELECT customer_id, order_id, amount, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY created_at DESC
) = 1;   -- latest order per customer  no subquery needed

MERGE for Upserts

SQL
MERGE INTO analytics_prod.silver.customers AS target
USING staging.customer_updates AS source
    ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
    UPDATE SET
        name       = source.name,
        email      = source.email,
        updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, created_at, updated_at)
    VALUES (source.customer_id, source.name, source.email,
            CURRENT_TIMESTAMP(), source.updated_at);

Time Travel

SQL
-- Query data at a specific point in the past
SELECT * FROM orders AT (TIMESTAMP => '2026-04-10 09:00:00'::TIMESTAMP_NTZ);

-- Query data before a specific DML statement
SELECT * FROM orders BEFORE (STATEMENT => '01b3d5a8-0000-1234-abcd-1234567890ab');

-- Restore a table to a previous state
CREATE OR REPLACE TABLE orders AS
    SELECT * FROM orders AT (TIMESTAMP => '2026-04-09 00:00:00'::TIMESTAMP_NTZ);

-- Clone table to a historical point (zero-copy  no storage until writes diverge)
CREATE TABLE orders_backup CLONE orders
    AT (TIMESTAMP => '2026-04-15 18:00:00'::TIMESTAMP_NTZ);

-- View table history and associated storage
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS(
    TABLE_NAME => 'ORDERS',
    TABLE_SCHEMA => 'SALES'
));

Zero-Copy Cloning

Cloning creates an instant, independent copy of any database object (table, schema, database) pointing to the same storage. You only pay for new writes after the clone diverges — useful for dev/test environments, point-in-time snapshots, and safe data experimentation.

SQL
-- Clone entire database for testing
CREATE DATABASE analytics_staging CLONE analytics_prod;

-- Clone at a point in time
CREATE TABLE orders_march CLONE orders
    AT (TIMESTAMP => '2026-03-31 23:59:59'::TIMESTAMP_NTZ);

Dynamic Tables — Declarative Pipelines

Dynamic Tables are Snowflake's answer to materialized views with automatic, incremental refresh:

SQL
-- Define a dynamic table  Snowflake manages refresh automatically
CREATE OR REPLACE DYNAMIC TABLE silver.customer_monthly_stats
    TARGET_LAG = '1 hour'        -- max data freshness
    WAREHOUSE = etl_wh
AS
SELECT
    customer_id,
    DATE_TRUNC('month', created_at)    AS month,
    COUNT(*)                            AS order_count,
    SUM(amount)                         AS revenue,
    AVG(amount)                         AS avg_order_value
FROM bronze.orders
WHERE status = 'completed'
GROUP BY 1, 2;

-- Check refresh history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY(
    NAME => 'silver.customer_monthly_stats'
));

Dynamic tables replace manual TASK-based refresh pipelines. Chain them to build a declarative Bronze → Silver → Gold Medallion Architecture.


Snowpark — Python in Snowflake

Snowpark brings Python, Java, and Scala directly to Snowflake — your code runs in Snowflake's compute, close to the data:

Python
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_, avg, count, lit, when
from snowflake.snowpark.types import IntegerType, FloatType

connection_params = {
    "account":   "my-org.snowflakecomputing.com",
    "user":      "my_service_account",
    "authenticator": "externalbrowser",   # SSO in interactive sessions
    # for service accounts: "private_key_file": "/path/to/key.p8"
    "role":      "ANALYST_ROLE",
    "warehouse": "analytics_wh",
    "database":  "analytics_prod",
    "schema":    "sales",
}

session = Session.builder.configs(connection_params).create()

# DataFrame API  lazy evaluation, pushes SQL to Snowflake
orders_df = session.table("orders")

result = (
    orders_df
    .filter(col("status") == "completed")
    .group_by("customer_id")
    .agg(
        count("*").alias("order_count"),
        sum_("amount").alias("total_spend"),
        avg("amount").alias("avg_order")
    )
    .filter(col("total_spend") > 1000)
    .sort(col("total_spend").desc())
)

result.show()

# Write result back to Snowflake
result.write.mode("overwrite").save_as_table("gold.high_value_customers")

Snowpark UDFs — Custom Python Functions in Snowflake

Python
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType, FloatType
import snowflake.snowpark.functions as F

# Register a scalar UDF  runs per-row inside Snowflake
@udf(name="categorise_amount", replace=True,
     return_type=StringType(), input_types=[FloatType()])
def categorise_amount(amount: float) -> str:
    if amount < 50:
        return "micro"
    elif amount < 500:
        return "standard"
    elif amount < 5000:
        return "large"
    return "enterprise"

# Use the UDF in a DataFrame
orders_df.with_column(
    "category",
    F.call_udf("categorise_amount", col("amount"))
).show()

Snowpark Stored Procedures — ETL in Python

Python
def transform_orders(session: Session, target_table: str) -> str:
    raw = session.table("bronze.raw_orders")

    clean = (
        raw
        .filter(col("order_id").is_not_null())
        .drop_duplicates(["order_id"])
        .with_column("amount", col("amount").cast(FloatType()))
        .with_column("loaded_at", F.current_timestamp())
    )

    clean.write.mode("overwrite").save_as_table(target_table)
    return f"Loaded {clean.count()} rows into {target_table}"

# Register as a stored procedure
session.sproc.register(
    func=transform_orders,
    name="sp_transform_orders",
    replace=True,
    is_permanent=True,
    stage_location="@my_stage",
    packages=["snowflake-snowpark-python"]
)

# Call it
session.call("sp_transform_orders", "silver.orders")

Data Sharing — Live Data Without Copying

Snowflake's Data Sharing lets you share live data with other Snowflake accounts (same cloud region or cross-region) without copying, moving, or managing replicas. The consumer queries your data directly in your storage — you maintain full control.

SQL
-- Provider side: create a share
CREATE SHARE product_data_share;

GRANT USAGE ON DATABASE analytics_prod TO SHARE product_data_share;
GRANT USAGE ON SCHEMA analytics_prod.gold TO SHARE product_data_share;
GRANT SELECT ON TABLE analytics_prod.gold.product_catalog TO SHARE product_data_share;
GRANT SELECT ON VIEW analytics_prod.gold.public_pricing TO SHARE product_data_share;

-- Add consumer account
ALTER SHARE product_data_share ADD ACCOUNTS = org_name.consumer_account;

-- Consumer side: create a database from the share
CREATE DATABASE partner_products FROM SHARE provider_org.product_data_share;

-- Query it as a normal database
SELECT * FROM partner_products.gold.product_catalog;

Snowflake Marketplace

The Snowflake Data Marketplace has 2,500+ live datasets from data providers. Subscribe and query live data (financial markets, weather, demographics, location data) directly in your account — no ETL, no copying:

SQL
-- After subscribing to a marketplace dataset, it appears as a database
SELECT *
FROM CYBERSYN__COMPANY_RELATIONSHIP_GRAPH.PUBLIC.COMPANY_INDEX
WHERE name ILIKE '%microsoft%'
LIMIT 20;

Streams and Tasks — CDC and Orchestration

Streams track DML changes (insert/update/delete) on tables for CDC patterns:

SQL
-- Create a stream on a table
CREATE STREAM orders_stream ON TABLE bronze.raw_orders;

-- The stream captures inserts, updates, deletes with metadata columns:
-- METADATA$ACTION: 'INSERT' or 'DELETE'
-- METADATA$ISUPDATE: TRUE for updates (appear as DELETE + INSERT pair)
-- METADATA$ROW_ID: stable row identifier across DML

-- Consume the stream in a task
SELECT * FROM orders_stream WHERE METADATA$ACTION = 'INSERT';

Tasks schedule SQL or Stored Procedure execution:

SQL
-- Create a scheduled task
CREATE TASK refresh_silver_orders
    WAREHOUSE = etl_wh
    SCHEDULE = 'USING CRON 0 * * * * UTC'   -- every hour
AS
    CALL sp_transform_orders('silver.orders');

-- Task DAG: child task runs after parent
CREATE TASK refresh_gold_stats
    WAREHOUSE = etl_wh
    AFTER refresh_silver_orders
AS
    CREATE OR REPLACE TABLE gold.monthly_stats AS
    SELECT DATE_TRUNC('month', created_at) AS month,
           SUM(amount) AS revenue
    FROM silver.orders
    GROUP BY 1;

-- Activate tasks
ALTER TASK refresh_gold_stats RESUME;
ALTER TASK refresh_silver_orders RESUME;

Row Access Policies and Column Masking

SQL
-- Row access policy: filter rows based on current role
CREATE ROW ACCESS POLICY tenant_isolation
    AS (tenant_id VARCHAR) RETURNS BOOLEAN ->
    CURRENT_ROLE() IN ('ADMIN_ROLE')
    OR tenant_id = CURRENT_USER();

ALTER TABLE silver.orders ADD ROW ACCESS POLICY tenant_isolation ON (tenant_id);

-- Column masking policy: mask PII for non-privileged roles
CREATE MASKING POLICY email_mask
    AS (val STRING) RETURNS STRING ->
    CASE WHEN CURRENT_ROLE() IN ('PII_ADMIN') THEN val
         ELSE REGEXP_REPLACE(val, '.+@', '*****@')
    END;

ALTER TABLE silver.customers MODIFY COLUMN email
    SET MASKING POLICY email_mask;

Cost Management

SQL
-- Credit usage by warehouse (last 30 days)
SELECT warehouse_name,
       SUM(credits_used)          AS total_credits,
       SUM(credits_used) * 3.00   AS estimated_usd  -- approximate Standard rate
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY total_credits DESC;

-- Most expensive queries
SELECT query_text,
       total_elapsed_time / 1000 AS elapsed_sec,
       credits_used_cloud_services,
       bytes_scanned / 1e9        AS gb_scanned,
       partitions_scanned,
       partitions_total,
       (partitions_scanned / NULLIF(partitions_total, 0)) * 100 AS pct_scanned
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY credits_used_cloud_services DESC NULLS LAST
LIMIT 20;

Cost Reduction Patterns

  • Auto-suspend: Set warehouses to suspend after 60–300 seconds idle. A suspended warehouse costs nothing.
  • Clustering keys: Reduce partitions scanned on large tables filtered by date or tenant.
  • QUALIFY instead of subqueries: Avoids redundant table scans for row-number filtering.
  • Result Cache: Identical queries within 24 hours serve from the result cache — zero compute cost.
  • Search Optimization Service: For point-lookup queries on high-cardinality columns — adds maintenance cost but dramatically reduces query cost on qualifying workloads.

Snowflake vs Databricks vs BigQuery

| | Snowflake | Databricks | BigQuery | |--|-----------|------------|---------| | Core strength | SQL analytics, sharing | Spark + ML | SQL analytics, serverless | | Data sharing | Native, zero-copy | Delta Sharing | Analytics Hub | | Streaming | Snowpipe (S3/Azure trigger) | Structured Streaming | Dataflow / Pub/Sub | | ML/AI | Snowpark ML, Cortex | Native MLflow, Feature Store | Vertex AI | | Pricing model | Credits (compute) + storage | DBUs + storage | Slots + bytes processed | | Language | SQL, Snowpark (Python/Scala/Java) | Python/SQL/Scala/R | SQL, BigQuery ML | | Best for | SQL-first teams, secure sharing | Data engineering + ML | GCP-native analytics |


Related: Databricks Guide — Delta Lake and PySpark
Related: MLflow Experiment Tracking — ML tracking and registry
Related: Azure Cloud Integration — Event Hubs and Azure Data Factory