Cassandra & DynamoDB: Wide-Column Databases for Massive Scale
When billions of writes per day aren't optional — master Cassandra data modeling, CQL, AWS DynamoDB single-table design, and cloud-managed wide-column services on Azure, AWS, and GCP.
When You Need a Wide-Column Database
Wide-column stores are purpose-built for write-heavy, time-ordered, or massive-scale workloads where SQL and document databases hit their limits:
- IoT telemetry — millions of sensor readings per second
- Time-series data — stock prices, metrics, audit logs
- Event sourcing — immutable append-only event ledgers
- Message history — WhatsApp, Slack-style chat archives
- Recommendation tracking — user interaction events at petabyte scale
The trade-off: you design your schema around your queries, not your domain. Flexible querying across dimensions is sacrificed for extreme throughput.
Apache Cassandra
Core Concepts
Cassandra is a distributed, peer-to-peer database. There is no master node — every node is equal. Data is partitioned across nodes using consistent hashing.
Ring topology: data is spread across N nodes
Replication factor 3: each row lives on 3 different nodes
Consistency level QUORUM: 2 of 3 nodes must agree → no single point of failureThe Golden Rule: Design Tables for Queries
In SQL you normalise first, then query. In Cassandra, you start with the queries and design a table for each access pattern.
-- The query we need:
-- "Get all sensor readings for device X in the last 24 hours, newest first"
-- The table design:
CREATE TABLE IF NOT EXISTS iot.sensor_readings (
device_id UUID,
recorded_at TIMESTAMP,
temperature FLOAT,
humidity FLOAT,
pressure FLOAT,
battery_pct SMALLINT,
PRIMARY KEY (device_id, recorded_at) -- device_id = partition key, recorded_at = clustering key
) WITH CLUSTERING ORDER BY (recorded_at DESC)
AND default_time_to_live = 2592000; -- auto-delete after 30 daysPartition key (device_id) — determines which node stores the row. Keep partitions below ~100MB.
Clustering key (recorded_at) — determines sort order within a partition.
-- Efficient — uses partition key + clustering key range
SELECT * FROM iot.sensor_readings
WHERE device_id = 550e8400-e29b-41d4-a716-446655440000
AND recorded_at > '2026-04-16 00:00:00'
ORDER BY recorded_at DESC
LIMIT 1000;Modeling for Multiple Access Patterns
Unlike SQL (one table, many queries), Cassandra needs one table per query pattern — denormalise aggressively.
-- Access pattern 1: Get order by ID
CREATE TABLE orders.by_id (
order_id UUID PRIMARY KEY,
customer_id UUID,
status TEXT,
total_cents INT,
created_at TIMESTAMP
);
-- Access pattern 2: Get all orders for a customer (newest first)
CREATE TABLE orders.by_customer (
customer_id UUID,
created_at TIMESTAMP,
order_id UUID,
status TEXT,
total_cents INT,
PRIMARY KEY (customer_id, created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC);
-- Access pattern 3: Get pending orders by region
CREATE TABLE orders.by_status_region (
region TEXT,
status TEXT,
created_at TIMESTAMP,
order_id UUID,
customer_id UUID,
PRIMARY KEY ((region, status), created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC);Wide Partitions (Buckets)
If a single partition can grow unboundedly (e.g., all events for a popular user), bucket by time:
-- Bad: single partition can grow to GB
-- PRIMARY KEY (user_id, event_time)
-- Good: partition by user + month bucket
CREATE TABLE activity.user_events (
user_id UUID,
month TEXT, -- '2026-04', '2026-05' etc.
event_id TIMEUUID,
type TEXT,
payload TEXT,
PRIMARY KEY ((user_id, month), event_id)
) WITH CLUSTERING ORDER BY (event_id DESC);
-- Application generates the bucket key:
var bucket = $"{userId}:{DateTime.UtcNow:yyyy-MM}";Lightweight Transactions (LWT)
Cassandra supports compare-and-swap via Paxos — use sparingly (10x slower).
-- Conditional insert — only if not exists
INSERT INTO users (user_id, email, username)
VALUES (uuid(), 'sarah@example.com', 'sarah_k')
IF NOT EXISTS;
-- Conditional update
UPDATE user_balances
SET balance = 950
WHERE user_id = :id
IF balance = 1000; -- optimistic concurrency checkCloud Cassandra Services
Azure Managed Instance for Apache Cassandra
az managed-cassandra cluster create \
--cluster-name myapp-cassandra \
--resource-group myRG \
--location eastus \
--delegated-management-subnet-id $SUBNET_ID \
--initial-cassandra-admin-password $CASS_PASS \
--cassandra-version 4.0AWS Keyspaces (Amazon Managed Cassandra)
aws keyspaces create-keyspace --keyspace-name myapp
aws keyspaces create-table \
--keyspace-name myapp \
--table-name sensor_readings \
--schema-definition '...'
# Serverless — no cluster to manage, pay per read/write unitGCP Bigtable (HBase-compatible, not CQL)
gcloud bigtable instances create myapp-bt \
--cluster=myapp-bt-c1 \
--cluster-zone=us-central1-a \
--cluster-num-nodes=3 \
--display-name="MyApp Bigtable"Amazon DynamoDB
DynamoDB is AWS's flagship fully serverless key-value and document database. No cluster to manage — capacity scales automatically.
Core Concepts
Table → one DynamoDB table (often one per app — single-table design)
Item → a record (like a row/document)
Partition key → required, determines shard
Sort key → optional, enables range queries within a partitionSingle-Table Design
The DynamoDB way: all entity types in one table, discriminated by key patterns.
// Key schema
// PK = partition key, SK = sort key
// User entity
{
PK: "USER#u_99",
SK: "PROFILE",
name: "Sarah K.",
email: "sarah@example.com",
plan: "pro",
createdAt: "2026-01-15T10:00:00Z"
}
// User's orders (sort by date)
{
PK: "USER#u_99",
SK: "ORDER#2026-04-16T09:00:00Z#ORD-001",
orderId: "ORD-001",
status: "delivered",
totalCents: 10997
}
// Order detail (standalone lookup)
{
PK: "ORDER#ORD-001",
SK: "DETAIL",
userId: "u_99",
items: [...]
}
// Query all orders for a user — efficient!
// KeyConditionExpression: PK = "USER#u_99" AND begins_with(SK, "ORDER#")DynamoDB Operations
const { DynamoDBClient, PutItemCommand, QueryCommand, UpdateItemCommand } = require("@aws-sdk/client-dynamodb")
const { DynamoDBDocumentClient } = require("@aws-sdk/lib-dynamodb")
const client = DynamoDBDocumentClient.from(new DynamoDBClient({ region: "us-east-1" }))
// Put item
await client.send(new PutItemCommand({
TableName: "MyApp",
Item: {
PK: { S: "USER#u_99" },
SK: { S: "PROFILE" },
name: { S: "Sarah K." },
plan: { S: "pro" }
},
ConditionExpression: "attribute_not_exists(PK)" // fail if exists
}))
// Query (efficient — uses index)
const result = await client.send(new QueryCommand({
TableName: "MyApp",
KeyConditionExpression: "PK = :pk AND begins_with(SK, :prefix)",
ExpressionAttributeValues: {
":pk": { S: "USER#u_99" },
":prefix": { S: "ORDER#" }
},
ScanIndexForward: false, // newest first
Limit: 20
}))Global Secondary Indexes (GSI)
// Add GSI to query orders by status across all users
// GSI: PK2 = status, SK2 = createdAt
// Item with GSI attributes
{
PK: "USER#u_99",
SK: "ORDER#...",
PK2: "STATUS#shipped", // GSI partition key
SK2: "2026-04-16T09:00Z" // GSI sort key
}
// Query: all shipped orders today
{
TableName: "MyApp",
IndexName: "StatusDateIndex",
KeyConditionExpression: "PK2 = :status AND SK2 BETWEEN :start AND :end",
ExpressionAttributeValues: {
":status": { S: "STATUS#shipped" },
":start": { S: "2026-04-16T00:00:00Z" },
":end": { S: "2026-04-16T23:59:59Z" }
}
}DynamoDB Streams + Lambda
// Automatically trigger Lambda on every change
exports.handler = async (event) => {
for (const record of event.Records) {
if (record.eventName === "INSERT") {
const newItem = record.dynamodb.NewImage
if (newItem.PK.S.startsWith("ORDER#")) {
await notifyFulfillmentService(newItem)
}
}
}
}Comparison: Cassandra vs DynamoDB vs Bigtable
| Feature | Cassandra | DynamoDB | Bigtable | |---|---|---|---| | Model | Wide-column (CQL) | Key-value + document | Wide-column (HBase) | | Ops model | Self-managed or managed | Fully serverless | Fully managed | | Query language | CQL (SQL-like) | API + expressions | API | | ACID | LWT only (slow) | Transactions (limited) | No | | Best for | Time series, IoT, chat | Serverless apps, AWS-native | Google ecosystem, analytics | | Cloud options | Azure Managed, AWS Keyspaces | AWS only | GCP only | | Learning curve | High (data modeling) | High (single-table design) | High | | Cost model | Per node/hour | Per RCU/WCU or on-demand | Per node + storage |
Key Takeaways
- Cassandra and DynamoDB are read-your-own-writes databases — design every table around a specific query.
- Wide partitions are dangerous — always bucket time-series data to prevent hot partitions.
- DynamoDB single-table design takes time to master but removes the N+1 query problem entirely.
- AWS Keyspaces is the easiest Cassandra entry point — no cluster to manage.
- Bigtable dominates in the Google Cloud analytics world — used internally by Gmail, Google Maps, and YouTube.
- Choose wide-column databases only when you have proven write/scale requirements — the operational complexity and limited query flexibility is a real cost.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.