Learnixo
Back to blog
Data Engineeringintermediate

PostgreSQL: The Developer's Complete Guide

Master PostgreSQL from setup to production — data types, indexing, JSONB, window functions, partitioning, replication, and managed cloud options on Azure, AWS, and GCP.

LearnixoApril 17, 20268 min read
PostgreSQLSQLDatabaseJSONBIndexingAzureAWSGCP
Share:𝕏

Why PostgreSQL?

PostgreSQL is the world's most advanced open-source relational database — and the first choice for most new production systems. It combines strict ACID compliance with features that blur the line with NoSQL: native JSON, arrays, full-text search, geospatial queries, and powerful extensions.

Used by: Apple, Instagram, Reddit, Spotify, Twitch, GitHub, GitLab, Shopify.


Installation & Setup

Local Development

Bash
# macOS
brew install postgresql@16
brew services start postgresql@16

# Ubuntu / Debian
sudo apt install postgresql-16
sudo systemctl start postgresql

# Docker (recommended for teams)
docker run -d \
  --name pg-dev \
  -e POSTGRES_PASSWORD=devpassword \
  -e POSTGRES_DB=myapp \
  -p 5432:5432 \
  postgres:16-alpine

First Connection

Bash
psql -U postgres -d myapp

# Or with connection string
psql "postgresql://postgres:devpassword@localhost:5432/myapp"

Core Data Types You Must Know

SQL
-- Integers
SMALLINT          -- 2 bytes, -32768 to 32767
INTEGER           -- 4 bytes
BIGINT            -- 8 bytes (use for IDs at scale)
SERIAL / BIGSERIAL  -- auto-increment (prefer IDENTITY in PG 10+)

-- Exact numeric
NUMERIC(10, 2)    -- money, measurements (never FLOAT for money)

-- Text
VARCHAR(255)      -- variable-length with limit
TEXT              -- unlimited (preferred in PostgreSQL)
CHAR(n)           -- fixed-length (rarely useful)

-- Date/Time
DATE              -- 2026-04-17
TIME              -- 14:30:00
TIMESTAMP         -- without timezone
TIMESTAMPTZ       -- WITH timezone (always prefer this)
INTERVAL          -- '3 hours', '2 days'

-- Boolean
BOOLEAN           -- true / false / null

-- Network
INET              -- IP address (192.168.1.1/24)
CIDR              -- network (192.168.1.0/24)
MACADDR           -- MAC address

-- Special
UUID              -- gen_random_uuid()  prefer over SERIAL for distributed systems
JSONB             -- binary JSON (indexable, fast)
JSON              -- text JSON (slower, preserves whitespace)
ARRAY             -- integer[], text[], etc.

Schema Design in Practice

SQL
-- Good schema for a SaaS app
CREATE TABLE tenants (
  id          UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
  name        TEXT        NOT NULL,
  plan        TEXT        NOT NULL DEFAULT 'free',
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE users (
  id          UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id   UUID        NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
  email       TEXT        NOT NULL,
  display_name TEXT,
  metadata    JSONB       DEFAULT '{}',
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (tenant_id, email)
);

CREATE TABLE orders (
  id          BIGINT      GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  user_id     UUID        NOT NULL REFERENCES users(id),
  tenant_id   UUID        NOT NULL REFERENCES tenants(id),
  total_cents INTEGER     NOT NULL CHECK (total_cents >= 0),
  status      TEXT        NOT NULL DEFAULT 'pending'
                          CHECK (status IN ('pending','processing','shipped','delivered','cancelled')),
  metadata    JSONB       DEFAULT '{}',
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Indexing: The Most Impactful Skill

B-Tree (Default — 90% of cases)

SQL
-- Single column
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Composite  order matters (equality first, then range)
CREATE INDEX idx_orders_tenant_status ON orders(tenant_id, status);

-- Partial  only index rows you query
CREATE INDEX idx_orders_pending ON orders(tenant_id, created_at)
WHERE status = 'pending';

-- Covering  index contains all needed columns (avoids heap fetch)
CREATE INDEX idx_orders_covering ON orders(tenant_id, status)
INCLUDE (total_cents, created_at);

GIN (Inverted — for JSONB and arrays)

SQL
-- Index all keys inside JSONB
CREATE INDEX idx_users_metadata ON users USING GIN(metadata);

-- Query uses the index:
SELECT * FROM users WHERE metadata @> '{"role": "admin"}';
SELECT * FROM users WHERE metadata ? 'phone_verified';

BRIN (Block Range — for time series)

SQL
-- Minimal storage, fast range scans on timestamp columns
CREATE INDEX idx_events_created_brin ON events USING BRIN(created_at);

Index Maintenance

SQL
-- Find unused indexes (wasted write overhead)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexname NOT LIKE '%_pkey';

-- Find slow queries needing indexes
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 20;

JSONB: NoSQL Inside PostgreSQL

SQL
-- Store flexible product attributes
CREATE TABLE products (
  id       UUID    PRIMARY KEY DEFAULT gen_random_uuid(),
  sku      TEXT    NOT NULL UNIQUE,
  name     TEXT    NOT NULL,
  category TEXT    NOT NULL,
  attrs    JSONB   DEFAULT '{}'  -- varies per category
);

INSERT INTO products (sku, name, category, attrs) VALUES
  ('LAPTOP-001', 'ThinkPad X1', 'laptop',
   '{"ram_gb": 32, "storage_gb": 1000, "screen_inch": 14, "weight_kg": 1.12}'),
  ('SHIRT-001', 'Oxford Shirt', 'clothing',
   '{"size": "L", "color": "navy", "material": "cotton", "care": ["machine wash"]}');

-- Query operators
SELECT * FROM products
WHERE attrs @> '{"ram_gb": 32}';           -- contains

SELECT * FROM products
WHERE attrs ->> 'color' = 'navy';          -- extract as text

SELECT attrs -> 'care' -> 0                -- navigate nested
FROM products WHERE sku = 'SHIRT-001';

-- Aggregate JSON data
SELECT category,
       AVG((attrs ->> 'ram_gb')::int) AS avg_ram
FROM products
WHERE attrs ? 'ram_gb'
GROUP BY category;

Window Functions

SQL
-- Running total per tenant
SELECT
  id,
  tenant_id,
  total_cents,
  SUM(total_cents) OVER (
    PARTITION BY tenant_id
    ORDER BY created_at
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total,
  RANK() OVER (PARTITION BY tenant_id ORDER BY total_cents DESC) AS rank_in_tenant,
  LAG(total_cents, 1) OVER (PARTITION BY tenant_id ORDER BY created_at) AS prev_order_value
FROM orders;

-- Percentile bucketing
SELECT
  user_id,
  total_cents,
  NTILE(4) OVER (ORDER BY total_cents) AS quartile,
  PERCENT_RANK() OVER (ORDER BY total_cents) AS percentile
FROM orders;

CTEs and Recursive Queries

SQL
-- Complex reporting with multiple CTEs
WITH
revenue_per_tenant AS (
  SELECT tenant_id, SUM(total_cents) AS total_revenue
  FROM orders WHERE status = 'delivered'
  GROUP BY tenant_id
),
top_tenants AS (
  SELECT tenant_id FROM revenue_per_tenant
  WHERE total_revenue > 1_000_000
)
SELECT t.name, r.total_revenue
FROM tenants t
JOIN revenue_per_tenant r ON r.tenant_id = t.id
WHERE t.id IN (SELECT tenant_id FROM top_tenants)
ORDER BY r.total_revenue DESC;

-- Recursive: org chart traversal
WITH RECURSIVE org_tree AS (
  SELECT id, name, manager_id, 1 AS depth
  FROM employees WHERE manager_id IS NULL

  UNION ALL

  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON ot.id = e.manager_id
)
SELECT * FROM org_tree ORDER BY depth, name;

Table Partitioning

SQL
-- Range partitioning for time series (each partition = one month)
CREATE TABLE events (
  id         BIGINT GENERATED ALWAYS AS IDENTITY,
  tenant_id  UUID NOT NULL,
  type       TEXT NOT NULL,
  payload    JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2026_01 PARTITION OF events
  FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE events_2026_02 PARTITION OF events
  FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
-- ...automate this with pg_partman extension

-- List partitioning by tenant
CREATE TABLE orders_partitioned (
  id        BIGINT,
  tenant_id TEXT NOT NULL,
  total     INTEGER
) PARTITION BY LIST (tenant_id);

CREATE TABLE orders_tenant_a PARTITION OF orders_partitioned
  FOR VALUES IN ('tenant-a', 'tenant-b');

Full-Text Search

SQL
-- Built-in FTS (no Elasticsearch needed for basic search)
ALTER TABLE products ADD COLUMN search_vector TSVECTOR;

UPDATE products SET search_vector =
  to_tsvector('english', name || ' ' || COALESCE(category, ''));

CREATE INDEX idx_products_fts ON products USING GIN(search_vector);

-- Search with ranking
SELECT name, ts_rank(search_vector, query) AS rank
FROM products, to_tsquery('english', 'laptop & SSD') AS query
WHERE search_vector @@ query
ORDER BY rank DESC;

Cloud Managed PostgreSQL

Azure Database for PostgreSQL Flexible Server

Bash
# Create with Azure CLI
az postgres flexible-server create \
  --resource-group myRG \
  --name myapp-pg \
  --location eastus \
  --admin-user pgadmin \
  --admin-password $PG_PASSWORD \
  --sku-name Standard_D4ds_v5 \
  --tier GeneralPurpose \
  --storage-size 128 \
  --version 16 \
  --high-availability ZoneRedundant

Key features: Zone-redundant HA, read replicas, point-in-time restore up to 35 days, Entra ID authentication, virtual network integration.

AWS RDS for PostgreSQL / Aurora PostgreSQL

Bash
# RDS PostgreSQL
aws rds create-db-instance \
  --db-instance-identifier myapp-pg \
  --db-instance-class db.r6g.large \
  --engine postgres \
  --engine-version 16.2 \
  --master-username pgadmin \
  --master-user-password $PG_PASSWORD \
  --allocated-storage 100 \
  --storage-type gp3 \
  --multi-az \
  --backup-retention-period 7

Aurora vs RDS: Aurora PostgreSQL is 3x faster, auto-scales storage to 128TB, and has a Serverless v2 option. Use Aurora for new production systems.

GCP Cloud SQL for PostgreSQL / AlloyDB

Bash
# Cloud SQL
gcloud sql instances create myapp-pg \
  --database-version=POSTGRES_16 \
  --region=us-central1 \
  --tier=db-custom-4-15360 \
  --availability-type=REGIONAL \
  --backup-start-time=02:00

# AlloyDB (PostgreSQL-compatible, 4x faster for OLTP)
gcloud alloydb clusters create myapp-cluster \
  --region=us-central1 \
  --password=$PG_PASSWORD

Connection Pooling

Direct connections are expensive. Always use PgBouncer or your cloud provider's built-in pooler.

INI
# pgbouncer.ini
[databases]
myapp = host=db.internal port=5432 dbname=myapp

[pgbouncer]
pool_mode = transaction        # recommended for most apps
max_client_conn = 1000
default_pool_size = 20
min_pool_size = 5
server_idle_timeout = 600
C#
// .NET — use Npgsql with connection pooling
var conn = "Host=localhost;Database=myapp;Username=app;Password=pass;Maximum Pool Size=20;";

Key Takeaways

  • Always use TIMESTAMPTZ — timezone-aware timestamps prevent subtle bugs in multi-region apps.
  • UUID over SERIAL for distributed IDs — gen_random_uuid() is fast and avoids coordination.
  • JSONB is a superpower — store variable attributes without sacrificing query performance.
  • Partial indexes can cut index size by 90% for filtered queries.
  • Use connection pooling — 20-50 pooled connections handle thousands of concurrent app connections.
  • Prefer managed (RDS, Flexible Server, Cloud SQL) over self-hosted — replication, backups, and failover are handled for you.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.