.NET & C# Development · Lesson 199 of 229
Case Study: 500 Tenants, One PostgreSQL Database
Case Study: 500 Tenants, One PostgreSQL Database
System: B2B project management SaaS — 500 business tenants, 85,000 users
Stack: ASP.NET Core 9, EF Core 9, PostgreSQL 16, Azure Database for PostgreSQL
Architecture: Shared database, shared schema (all tenants in the same tables)
Journey duration: 3 years (20 tenants → 500 tenants)
Three inflection points: 50 tenants, 200 tenants, 450 tenants
Starting Architecture (20 Tenants)
All tenants share one database, one schema.
Every table has a TenantId column.
Application code filters by TenantId on every query.
Schema:
projects (id, tenant_id, name, status, ...)
tasks (id, tenant_id, project_id, title, ...)
users (id, tenant_id, email, role, ...)
comments (id, tenant_id, task_id, body, ...)
Row count at 20 tenants:
projects: ~4,000 rows
tasks: ~40,000 rows
users: ~2,000 rows
This worked fine. Zero problems.Inflection Point 1: 50 Tenants — The Data Leak Bug
At 50 tenants, a developer forgot a .Where(x => x.TenantId == tenantId) on a LINQ query. For 3 hours, one enterprise customer could see another customer's projects.
// The bug — easy to write, hard to notice
public async Task<List<Project>> GetProjectsAsync(int tenantId)
{
return await db.Projects
// .Where(p => p.TenantId == tenantId) ← developer forgot this
.Where(p => p.Status == "Active")
.ToListAsync();
}Fix 1: Global Query Filters
// DbContext — apply TenantId filter to every query, automatically
public class AppDbContext(
DbContextOptions<AppDbContext> opts,
ITenantContext tenant)
: DbContext(opts)
{
protected override void OnModelCreating(ModelBuilder model)
{
// Applied to every query on these entities — impossible to forget
model.Entity<Project>().HasQueryFilter(p => p.TenantId == tenant.TenantId);
model.Entity<Task>().HasQueryFilter(t => t.TenantId == tenant.TenantId);
model.Entity<User>().HasQueryFilter(u => u.TenantId == tenant.TenantId);
model.Entity<Comment>().HasQueryFilter(c => c.TenantId == tenant.TenantId);
}
}
// Register tenant context per-request
public class TenantContext(IHttpContextAccessor http) : ITenantContext
{
public int TenantId => int.Parse(
http.HttpContext!.User.FindFirst("tenant_id")!.Value);
}// Program.cs
builder.Services.AddHttpContextAccessor();
builder.Services.AddScoped<ITenantContext, TenantContext>();
builder.Services.AddDbContext<AppDbContext>();After this change:
- Every EF Core query automatically includes WHERE tenant_id = ?
- Developers cannot forget the filter — it is not optional
- To deliberately bypass (for admin queries): use .IgnoreQueryFilters()
which is a conspicuous call, easy to spot in code reviewFix 2: PostgreSQL Row-Level Security (defence-in-depth)
-- Even if the application sends a query without a tenant filter,
-- PostgreSQL enforces the policy at the database level
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
ALTER TABLE comments ENABLE ROW LEVEL SECURITY;
-- Policy: only rows where tenant_id matches the session setting
CREATE POLICY tenant_isolation ON projects
USING (tenant_id = current_setting('app.tenant_id')::int);
CREATE POLICY tenant_isolation ON tasks
USING (tenant_id = current_setting('app.tenant_id')::int);// Set the session variable on every request
public class TenantRlsInterceptor(ITenantContext tenant)
: DbCommandInterceptor
{
public override async ValueTask<DbDataReader> ReaderExecutingAsync(
DbCommand command,
CommandEventData eventData,
InterceptionResult<DbDataReader> result,
CancellationToken ct = default)
{
// Set the PostgreSQL session variable before every command
await using var setCmd = command.Connection!.CreateCommand();
setCmd.Transaction = command.Transaction;
setCmd.CommandText = $"SET LOCAL app.tenant_id = {tenant.TenantId}";
await setCmd.ExecuteNonQueryAsync(ct);
return await base.ReaderExecutingAsync(command, eventData, result, ct);
}
}Inflection Point 2: 200 Tenants — The Noisy Neighbour
At 200 tenants, the database started showing P99 latency spikes. Investigation with pg_stat_activity revealed a single enterprise tenant (Tenant #47) running a nightly export job that scanned 2M rows — blocking connections for everyone.
Monitor: pg_stat_statements
SELECT
calls,
mean_exec_time,
total_exec_time,
(regexp_matches(query, 'tenant_id = (\d+)'))[1] AS tenant_id,
left(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Result:
Tenant 47 export query: 45 seconds mean exec time, 1,200 calls/day
Cost: 54,000 seconds of database time per day
All other tenants combined: ~8,000 secondsFix: Per-Tenant Statement Timeout + Connection Isolation
-- Apply a statement timeout per tenant (stored in tenants table)
-- Default: 30 seconds. Tenant 47 (enterprise): 300 seconds.
UPDATE tenants SET statement_timeout_ms = 300000 WHERE id = 47;
UPDATE tenants SET statement_timeout_ms = 30000 WHERE statement_timeout_ms IS NULL;// Apply the tenant's timeout on each connection
public class TenantConnectionInterceptor(ITenantContext tenant, TenantSettingsCache settings)
: DbConnectionInterceptor
{
public override async Task ConnectionOpenedAsync(
DbConnection connection,
ConnectionEndEventData eventData,
CancellationToken ct = default)
{
var cfg = await settings.GetAsync(tenant.TenantId, ct);
await using var cmd = connection.CreateCommand();
cmd.CommandText = $"SET statement_timeout = '{cfg.StatementTimeoutMs}';";
await cmd.ExecuteNonQueryAsync(ct);
}
}Fix: Separate Connection Pool for Background Jobs
// Tenant's export jobs go through a separate pool with lower priority
// The main API pool is never affected by long-running exports
// appsettings.json
{
"ConnectionStrings": {
"Api": "Host=db;Database=saas;Username=api_user;Maximum Pool Size=50",
"Background": "Host=db;Database=saas;Username=bg_user;Maximum Pool Size=10"
}
}
// bg_user PostgreSQL role has lower priority and a default statement_timeout of 120s
// CREATE ROLE bg_user WITH LOGIN PASSWORD '...' NOSUPERUSER;
// ALTER ROLE bg_user SET statement_timeout = '120s';Fix: Index Optimisation for Tenant-Specific Queries
-- Before: index on tenant_id alone — low cardinality, poor selectivity
CREATE INDEX idx_tasks_tenant ON tasks (tenant_id);
-- After: composite index — tenant_id first, then the common filter columns
-- Covers: WHERE tenant_id = 47 AND status = 'Active' AND created_at > '2026-01-01'
CREATE INDEX idx_tasks_tenant_status_created
ON tasks (tenant_id, status, created_at DESC)
WHERE deleted_at IS NULL; -- partial index excludes soft-deleted rows
-- Result: Tenant 47 export dropped from 45s to 2.1sInflection Point 3: 450 Tenants — Schema Separation for Enterprise
At 450 tenants, two enterprise clients (each with 5,000+ users) signed contracts requiring:
- Data isolation guarantees (auditable proof that their data is separate)
- Ability to run their own database backups
- Custom retention policies
Shared-schema RLS was not enough for their compliance teams.
Decision: Hybrid Model — Shared Schema for SMB, Separate Schema for Enterprise
Architecture after:
Shared schema: tenants 1-448 (SMB customers, default)
Schema per tenant: tenants 449, 450 (enterprise customers)
Routing logic:
- Check tenant configuration at startup
- If enterprise: connect to tenant-specific schema
- If SMB: connect to shared schema with RLS// Factory that returns the right DbContext based on tenant tier
public class TenantDbContextFactory(
IConfiguration config,
TenantSettingsCache settings,
IServiceProvider sp)
{
public async Task<AppDbContext> CreateAsync(int tenantId, CancellationToken ct)
{
var tenant = await settings.GetAsync(tenantId, ct);
if (tenant.Tier == TenantTier.Enterprise)
{
// Enterprise: dedicated connection string per tenant
var connStr = config[$"ConnectionStrings:Tenant_{tenantId}"];
var opts = new DbContextOptionsBuilder<AppDbContext>()
.UseNpgsql(connStr)
.Options;
return new AppDbContext(opts, sp.GetRequiredService<ITenantContext>());
}
// SMB: shared connection string (RLS enforces isolation)
return sp.GetRequiredService<AppDbContext>();
}
}Monitoring Setup That Prevented Further Incidents
// Background job: runs every minute, emits per-tenant metrics
public class TenantDatabaseMonitor(AppDbContext db, IMetrics metrics)
: BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
// Query pg_stat_statements for top-10 tenants by db time
var tenantCosts = await db.Database
.SqlQueryRaw<TenantDbCost>("""
SELECT
(regexp_matches(query, 'tenant_id = (\d+)'))[1]::int AS TenantId,
SUM(total_exec_time) AS TotalExecMs,
SUM(calls) AS TotalCalls
FROM pg_stat_statements
WHERE query LIKE '%tenant_id%'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
""")
.ToListAsync(ct);
foreach (var t in tenantCosts)
{
metrics.RecordGauge("db.exec_time_ms", t.TotalExecMs,
new Tag("tenant_id", t.TenantId.ToString()));
}
await System.Threading.Tasks.Task.Delay(TimeSpan.FromMinutes(1), ct);
}
}
}Alerts configured:
- Any single tenant consuming >20% of total database time → investigate
- P99 latency above 2000ms → page on-call
- Connection pool >80% utilized → add read replica
- Tenant with zero queries for 30 days → churn risk signalResults and Architecture Summary
Tenant count: 20 → 500 (25x growth)
Database CPU: Stays under 40% peak
P99 latency: Under 800ms (was spiking to 12,000ms at 200 tenants)
Data leak bugs: Zero since global query filters + RLS (3 years)
What worked:
1. Global query filters — eliminate entire class of data leak bugs
2. RLS — defence-in-depth; the database is the last line of defence
3. Per-tenant statement timeout — noisy neighbours get a shorter leash
4. Composite indexes (tenant_id, ...) — queries are narrow by design
5. Hybrid architecture — shared schema for SMB, dedicated for enterprise
What didn't work:
- Shared pool for API + background jobs — background jobs steal connections
- Index on tenant_id alone — low cardinality means full table scans
- Trusting application-level filtering alone — one bug exposes all tenantsWhen to Move Off a Shared Database
Stay on shared database while:
- All tenants fit in one database (under 500GB with good indexing)
- No compliance requirement for physical data separation
- P99 latency is acceptable with per-tenant timeouts and indexing
Move to per-tenant databases when:
- An enterprise client requires auditable physical isolation
- A single tenant's data volume dominates the database (>30% of rows)
- You need per-tenant backup/restore SLAs
- Different tenants need different database versions or extensions
Hybrid is the pragmatic answer:
- SMB tenants: shared schema, low operational cost
- Enterprise tenants: dedicated schema, premium tier pricing
- The routing layer adds one config lookup per request — negligible cost