Cassandra & DynamoDB: Wide-Column Databases for Massive Scale

When You Need a Wide-Column Database

Wide-column stores are purpose-built for write-heavy, time-ordered, or massive-scale workloads where SQL and document databases hit their limits:

IoT telemetry — millions of sensor readings per second
Time-series data — stock prices, metrics, audit logs
Event sourcing — immutable append-only event ledgers
Message history — WhatsApp, Slack-style chat archives
Recommendation tracking — user interaction events at petabyte scale

The trade-off: you design your schema around your queries, not your domain. Flexible querying across dimensions is sacrificed for extreme throughput.

Apache Cassandra

Core Concepts

Cassandra is a distributed, peer-to-peer database. There is no master node — every node is equal. Data is partitioned across nodes using consistent hashing.

Ring topology: data is spread across N nodes
Replication factor 3: each row lives on 3 different nodes
Consistency level QUORUM: 2 of 3 nodes must agree → no single point of failure

The Golden Rule: Design Tables for Queries

In SQL you normalise first, then query. In Cassandra, you start with the queries and design a table for each access pattern.

SQL

-- The query we need:
-- "Get all sensor readings for device X in the last 24 hours, newest first"

-- The table design:
CREATE TABLE IF NOT EXISTS iot.sensor_readings (
  device_id   UUID,
  recorded_at TIMESTAMP,
  temperature FLOAT,
  humidity    FLOAT,
  pressure    FLOAT,
  battery_pct SMALLINT,
  PRIMARY KEY (device_id, recorded_at)  -- device_id = partition key, recorded_at = clustering key
) WITH CLUSTERING ORDER BY (recorded_at DESC)
  AND default_time_to_live = 2592000;   -- auto-delete after 30 days

Partition key (device_id) — determines which node stores the row. Keep partitions below ~100MB.
Clustering key (recorded_at) — determines sort order within a partition.

SQL

-- Efficient — uses partition key + clustering key range
SELECT * FROM iot.sensor_readings
WHERE device_id = 550e8400-e29b-41d4-a716-446655440000
  AND recorded_at > '2026-04-16 00:00:00'
ORDER BY recorded_at DESC
LIMIT 1000;

Modeling for Multiple Access Patterns

Unlike SQL (one table, many queries), Cassandra needs one table per query pattern — denormalise aggressively.

SQL

-- Access pattern 1: Get order by ID
CREATE TABLE orders.by_id (
  order_id    UUID PRIMARY KEY,
  customer_id UUID,
  status      TEXT,
  total_cents INT,
  created_at  TIMESTAMP
);

-- Access pattern 2: Get all orders for a customer (newest first)
CREATE TABLE orders.by_customer (
  customer_id UUID,
  created_at  TIMESTAMP,
  order_id    UUID,
  status      TEXT,
  total_cents INT,
  PRIMARY KEY (customer_id, created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC);

-- Access pattern 3: Get pending orders by region
CREATE TABLE orders.by_status_region (
  region      TEXT,
  status      TEXT,
  created_at  TIMESTAMP,
  order_id    UUID,
  customer_id UUID,
  PRIMARY KEY ((region, status), created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC);

Wide Partitions (Buckets)

If a single partition can grow unboundedly (e.g., all events for a popular user), bucket by time:

SQL

-- Bad: single partition can grow to GB
-- PRIMARY KEY (user_id, event_time)

-- Good: partition by user + month bucket
CREATE TABLE activity.user_events (
  user_id   UUID,
  month     TEXT,         -- '2026-04', '2026-05' etc.
  event_id  TIMEUUID,
  type      TEXT,
  payload   TEXT,
  PRIMARY KEY ((user_id, month), event_id)
) WITH CLUSTERING ORDER BY (event_id DESC);

-- Application generates the bucket key:
var bucket = $"{userId}:{DateTime.UtcNow:yyyy-MM}";

Lightweight Transactions (LWT)

Cassandra supports compare-and-swap via Paxos — use sparingly (10x slower).

SQL

-- Conditional insert — only if not exists
INSERT INTO users (user_id, email, username)
VALUES (uuid(), 'sarah@example.com', 'sarah_k')
IF NOT EXISTS;

-- Conditional update
UPDATE user_balances
SET balance = 950
WHERE user_id = :id
IF balance = 1000;    -- optimistic concurrency check

Cloud Cassandra Services

Azure Managed Instance for Apache Cassandra

Bash

az managed-cassandra cluster create \
  --cluster-name myapp-cassandra \
  --resource-group myRG \
  --location eastus \
  --delegated-management-subnet-id $SUBNET_ID \
  --initial-cassandra-admin-password $CASS_PASS \
  --cassandra-version 4.0

AWS Keyspaces (Amazon Managed Cassandra)

Bash

aws keyspaces create-keyspace --keyspace-name myapp
aws keyspaces create-table \
  --keyspace-name myapp \
  --table-name sensor_readings \
  --schema-definition '...'
# Serverless — no cluster to manage, pay per read/write unit

GCP Bigtable (HBase-compatible, not CQL)

Bash

gcloud bigtable instances create myapp-bt \
  --cluster=myapp-bt-c1 \
  --cluster-zone=us-central1-a \
  --cluster-num-nodes=3 \
  --display-name="MyApp Bigtable"

Amazon DynamoDB

DynamoDB is AWS's flagship fully serverless key-value and document database. No cluster to manage — capacity scales automatically.

Core Concepts

Table        → one DynamoDB table (often one per app — single-table design)
Item         → a record (like a row/document)
Partition key → required, determines shard
Sort key     → optional, enables range queries within a partition

Single-Table Design

The DynamoDB way: all entity types in one table, discriminated by key patterns.

JAVASCRIPT

// Key schema
// PK = partition key, SK = sort key

// User entity
{
  PK: "USER#u_99",
  SK: "PROFILE",
  name: "Sarah K.",
  email: "sarah@example.com",
  plan: "pro",
  createdAt: "2026-01-15T10:00:00Z"
}

// User's orders (sort by date)
{
  PK: "USER#u_99",
  SK: "ORDER#2026-04-16T09:00:00Z#ORD-001",
  orderId: "ORD-001",
  status: "delivered",
  totalCents: 10997
}

// Order detail (standalone lookup)
{
  PK: "ORDER#ORD-001",
  SK: "DETAIL",
  userId: "u_99",
  items: [...]
}

// Query all orders for a user — efficient!
// KeyConditionExpression: PK = "USER#u_99" AND begins_with(SK, "ORDER#")

DynamoDB Operations

JAVASCRIPT

const { DynamoDBClient, PutItemCommand, QueryCommand, UpdateItemCommand } = require("@aws-sdk/client-dynamodb")
const { DynamoDBDocumentClient } = require("@aws-sdk/lib-dynamodb")

const client = DynamoDBDocumentClient.from(new DynamoDBClient({ region: "us-east-1" }))

// Put item
await client.send(new PutItemCommand({
  TableName: "MyApp",
  Item: {
    PK: { S: "USER#u_99" },
    SK: { S: "PROFILE" },
    name: { S: "Sarah K." },
    plan: { S: "pro" }
  },
  ConditionExpression: "attribute_not_exists(PK)"  // fail if exists
}))

// Query (efficient — uses index)
const result = await client.send(new QueryCommand({
  TableName: "MyApp",
  KeyConditionExpression: "PK = :pk AND begins_with(SK, :prefix)",
  ExpressionAttributeValues: {
    ":pk":     { S: "USER#u_99" },
    ":prefix": { S: "ORDER#" }
  },
  ScanIndexForward: false,  // newest first
  Limit: 20
}))

Global Secondary Indexes (GSI)

JAVASCRIPT

// Add GSI to query orders by status across all users
// GSI: PK2 = status, SK2 = createdAt

// Item with GSI attributes
{
  PK: "USER#u_99",
  SK: "ORDER#...",
  PK2: "STATUS#shipped",   // GSI partition key
  SK2: "2026-04-16T09:00Z" // GSI sort key
}

// Query: all shipped orders today
{
  TableName: "MyApp",
  IndexName: "StatusDateIndex",
  KeyConditionExpression: "PK2 = :status AND SK2 BETWEEN :start AND :end",
  ExpressionAttributeValues: {
    ":status": { S: "STATUS#shipped" },
    ":start":  { S: "2026-04-16T00:00:00Z" },
    ":end":    { S: "2026-04-16T23:59:59Z" }
  }
}

DynamoDB Streams + Lambda

JAVASCRIPT

// Automatically trigger Lambda on every change
exports.handler = async (event) => {
  for (const record of event.Records) {
    if (record.eventName === "INSERT") {
      const newItem = record.dynamodb.NewImage
      if (newItem.PK.S.startsWith("ORDER#")) {
        await notifyFulfillmentService(newItem)
      }
    }
  }
}

Comparison: Cassandra vs DynamoDB vs Bigtable

| Feature | Cassandra | DynamoDB | Bigtable | |---|---|---|---| | Model | Wide-column (CQL) | Key-value + document | Wide-column (HBase) | | Ops model | Self-managed or managed | Fully serverless | Fully managed | | Query language | CQL (SQL-like) | API + expressions | API | | ACID | LWT only (slow) | Transactions (limited) | No | | Best for | Time series, IoT, chat | Serverless apps, AWS-native | Google ecosystem, analytics | | Cloud options | Azure Managed, AWS Keyspaces | AWS only | GCP only | | Learning curve | High (data modeling) | High (single-table design) | High | | Cost model | Per node/hour | Per RCU/WCU or on-demand | Per node + storage |

Key Takeaways

Cassandra and DynamoDB are read-your-own-writes databases — design every table around a specific query.
Wide partitions are dangerous — always bucket time-series data to prevent hot partitions.
DynamoDB single-table design takes time to master but removes the N+1 query problem entirely.
AWS Keyspaces is the easiest Cassandra entry point — no cluster to manage.
Bigtable dominates in the Google Cloud analytics world — used internally by Gmail, Google Maps, and YouTube.
Choose wide-column databases only when you have proven write/scale requirements — the operational complexity and limited query flexibility is a real cost.

Cassandra & DynamoDB: Wide-Column Databases for Massive Scale

When You Need a Wide-Column Database

Apache Cassandra

Core Concepts

The Golden Rule: Design Tables for Queries

Modeling for Multiple Access Patterns

Wide Partitions (Buckets)

Lightweight Transactions (LWT)

Cloud Cassandra Services

Amazon DynamoDB

Core Concepts

Single-Table Design

DynamoDB Operations

Global Secondary Indexes (GSI)

DynamoDB Streams + Lambda

Comparison: Cassandra vs DynamoDB vs Bigtable

Key Takeaways

Enjoyed this article?

Leave a comment