Back to blog
architectureintermediate

Scaling to 10 Million Users — A Developer's Complete Playbook

Everything a developer needs to know about scaling a system from hundreds to 10 million users. Horizontal and vertical scaling, caching strategies, connection pooling, database sharding, observability, and how to design APIs that survive growth.

Asma HafeezApril 17, 202620 min read
system-designscalabilitycachinghorizontal-scalingobservabilityarchitecture
Share:𝕏

Scaling to 10 Million Users — A Developer's Complete Playbook

Your app works perfectly with 100 users. Then 1,000. Then suddenly you're at 50,000 and things start breaking. Queries time out. Deployments take the site down. Users hit errors at peak hours.

Most developers only think about scalability when it's already on fire. This guide is for building it in from the start — or adding it before the fire starts.

We'll cover every layer: vertical scaling, horizontal scaling, load balancing, caching, connection pooling, database patterns, async processing, observability, and how to design APIs that survive growth.


The Journey: How Systems Evolve

Before tactics, understand the stages. Every system at 10M users started at 1 user. The architecture that works at each stage is different:

| Stage | Users | Problem | Solution | |-------|-------|---------|----------| | MVP | 1–1k | Just ship it | Single server, single DB | | Growing | 1k–100k | DB bottleneck | Read replicas, caching | | Scaling | 100k–1M | Traffic spikes | Horizontal scaling, CDN | | Mature | 1M–10M | Data volume | Sharding, microservices | | Hyperscale | 10M+ | Everything | Distributed everything |

Don't build for 10M on day one. Build for the next stage, not five stages ahead. Over-engineering early is as costly as under-engineering.


Part 1 — Vertical Scaling

Vertical scaling (scale up) means giving your existing server more resources: more CPU, more RAM, faster disk.

Single server: 2 vCPU, 4 GB RAM
          ↓ vertical scale
Single server: 16 vCPU, 64 GB RAM

When to use it:

  • First response to performance problems
  • Stateful services that are hard to distribute (e.g., legacy apps)
  • Databases before sharding

Limits and trade-offs:

  • Has a ceiling — you can't add infinite CPU to one machine
  • Single point of failure — one server means one outage point
  • Downtime required to resize (usually)
  • Cost increases non-linearly at the high end

In practice: Always exhaust vertical scaling before horizontal. Adding RAM is cheaper and faster than rewriting for distribution.


Part 2 — Horizontal Scaling

Horizontal scaling (scale out) means adding more servers instead of bigger ones.

Server A (single, overloaded)
          ↓ horizontal scale
Server A + Server B + Server C (behind a load balancer)

The Load Balancer

A load balancer sits in front of your servers and distributes requests across them.

Client → Load Balancer → Server A
                       → Server B
                       → Server C

Load balancing algorithms:

  • Round robin — distribute evenly in sequence
  • Least connections — send to server with fewest active connections
  • IP hash — same client always goes to same server (sticky sessions)
  • Weighted round robin — servers with more capacity get more traffic

Cloud load balancers:

  • Azure: Application Gateway (L7), Azure Load Balancer (L4)
  • AWS: ALB (L7), NLB (L4)
  • GCP: Cloud Load Balancing

The Session Problem

HTTP is stateless, but your app might store session data in memory. When you scale out, Server B doesn't have the session that Server A created for a user.

Solutions:

  1. Sticky sessions — load balancer always sends a user to the same server. Easy but bad for reliability.

  2. Shared session store (best practice) — store sessions in Redis, not in-process:

C#
// .NET — store sessions in Redis
builder.Services.AddStackExchangeRedisCache(options =>
    options.Configuration = "redis:6379");

builder.Services.AddSession(options =>
{
    options.IdleTimeout = TimeSpan.FromMinutes(30);
    options.Cookie.HttpOnly = true;
    options.Cookie.IsEssential = true;
});
  1. Stateless design (best) — use JWT tokens. The token carries all state, any server can validate it.
C#
// JWT validation — works on any server instance
builder.Services.AddAuthentication(JwtBearerDefaults.AuthenticationScheme)
    .AddJwtBearer(options =>
    {
        options.TokenValidationParameters = new TokenValidationParameters
        {
            ValidateIssuerSigningKey = true,
            IssuerSigningKey = new SymmetricSecurityKey(
                Encoding.UTF8.GetBytes(config["Jwt:Secret"]!)),
            ValidateIssuer = true,
            ValidateAudience = true,
            ClockSkew = TimeSpan.Zero
        };
    });

Auto-Scaling

Manual scaling is slow. Auto-scaling adjusts capacity automatically based on metrics.

CPU > 70% for 3 minutes → add 2 servers
CPU < 30% for 10 minutes → remove 1 server

In Azure:

Bash
# Scale App Service based on CPU
az monitor autoscale create \
  --resource-group MyRG \
  --resource my-app-service-plan \
  --resource-type Microsoft.Web/serverfarms \
  --name autoscale-cpu \
  --min-count 2 \
  --max-count 10 \
  --count 2

az monitor autoscale rule create \
  --resource-group MyRG \
  --autoscale-name autoscale-cpu \
  --condition "Percentage CPU > 70 avg 5m" \
  --scale out 2

Part 3 — Caching

The most impactful single change you can make to a slow system is adding a cache.

The fundamental insight: most reads are for the same data, repeatedly. Why hit the database every time?

Without cache: 100 users load /products → 100 DB queries
With cache:    100 users load /products → 1 DB query + 99 cache hits

Caching Layers

There are four places you can cache, from fastest to slowest:

1. Browser cache (fastest — no network at all)
2. CDN cache (fast — nearest edge server)
3. Application cache (in-process — no network hop)
4. Distributed cache / Redis (fast — single network hop to Redis)

In-Process Cache (.NET)

For data that doesn't change often and is small enough to live in app memory:

C#
builder.Services.AddMemoryCache();

public class ProductService
{
    private readonly IMemoryCache _cache;
    private readonly IProductRepository _products;

    public async Task<IReadOnlyList<Product>> GetFeaturedProductsAsync()
    {
        return await _cache.GetOrCreateAsync("featured-products", async entry =>
        {
            entry.AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(5);
            entry.SlidingExpiration = TimeSpan.FromMinutes(2);
            return await _products.GetFeaturedAsync();
        }) ?? [];
    }
}

Problem: In-process cache doesn't work with horizontal scaling — each server has its own cache and they can get out of sync.

Distributed Cache — Redis

Redis is the standard distributed cache. All servers share one Redis instance.

C#
// Cache-aside pattern — most common
public async Task<Product?> GetProductAsync(string id)
{
    var cacheKey = $"product:{id}";

    // 1. Check cache
    var cached = await _cache.GetStringAsync(cacheKey);
    if (cached is not null)
        return JsonSerializer.Deserialize<Product>(cached);

    // 2. Cache miss — hit the database
    var product = await _db.Products.FindAsync(id);
    if (product is null) return null;

    // 3. Populate cache
    await _cache.SetStringAsync(cacheKey, JsonSerializer.Serialize(product),
        new DistributedCacheEntryOptions
        {
            AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(15)
        });

    return product;
}

// Cache invalidation — when data changes
public async Task UpdateProductAsync(Product updated)
{
    await _db.SaveChangesAsync();
    await _cache.RemoveAsync($"product:{updated.Id}");
    // or update the cache directly:
    await _cache.SetStringAsync($"product:{updated.Id}",
        JsonSerializer.Serialize(updated),
        new DistributedCacheEntryOptions
        {
            AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(15)
        });
}

Caching Strategies

| Pattern | How it works | Best for | |---------|-------------|----------| | Cache-aside | App checks cache, misses go to DB | Most use cases — default choice | | Write-through | Every write goes to both DB and cache | Read-heavy data where stale reads are unacceptable | | Write-behind | Writes go to cache first, async to DB | Very high write throughput | | Read-through | Cache fetches from DB automatically | Simplifies app code | | Refresh-ahead | Cache proactively refreshes before expiry | Predictable access patterns |

What to Cache

✅ Cache:
  - Product listings, category pages
  - User profile data
  - Configuration and feature flags
  - Computed values (totals, summaries)
  - API responses from third-party services

❌ Don't cache:
  - Passwords, tokens, sensitive auth data
  - Real-time data (stock prices, live inventory)
  - User-specific financial data
  - Write-heavy data (causes constant invalidation)

CDN for Static Assets

A CDN (Content Delivery Network) caches your static files at edge nodes worldwide.

User in Oslo → CDN edge in Oslo (10ms) instead of Server in US (200ms)

For your API, use CDN for:

  • Static assets (JS, CSS, images)
  • Cacheable API responses (add Cache-Control: public, max-age=300)
C#
// Add cache headers to API responses
[HttpGet("products")]
[ResponseCache(Duration = 300, Location = ResponseCacheLocation.Any)]
public async Task<IActionResult> GetProducts()
{
    var products = await _productService.GetFeaturedProductsAsync();
    return Ok(products);
}

Part 4 — Connection Pooling

Every database connection is expensive to create. Opening a connection involves network handshake, authentication, and resource allocation — it takes 10–50ms.

Without pooling:

Request 1: Open connection (50ms) → Query (5ms) → Close connection
Request 2: Open connection (50ms) → Query (5ms) → Close connection
Request 3: Open connection (50ms) → Query (5ms) → Close connection

With pooling:

App startup: Create 10 connections, keep them open
Request 1: Borrow connection (0ms) → Query (5ms) → Return to pool
Request 2: Borrow connection (0ms) → Query (5ms) → Return to pool
Request 3: Borrow connection (0ms) → Query (5ms) → Return to pool

Connection Pooling in .NET

ADO.NET and EF Core pool connections automatically. The pool is controlled via the connection string:

Server=sql.example.com;Database=mydb;User=app;Password=xxx;
Min Pool Size=5;Max Pool Size=100;Connection Timeout=30;

Key settings:

  • Min Pool Size — connections kept open always (warm start)
  • Max Pool Size — maximum concurrent connections (default: 100)
  • Connection Timeout — seconds to wait for a free connection before throwing

Monitoring Pool Exhaustion

When all connections are in use and a new request arrives, it waits. If Connection Timeout passes, it throws. This is connection pool exhaustion — one of the most common production incidents at scale.

C#
// Application Insights custom metric to watch pool usage
_telemetry.TrackMetric("DbConnectionPoolUsed", pool.ConnectionsInUse);
_telemetry.TrackMetric("DbConnectionPoolAvailable", pool.ConnectionsAvailable);

Signs of pool exhaustion:

  • System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool
  • Database queries suddenly timing out at traffic spikes
  • High connection wait times in Application Insights

Fixes:

  1. Increase Max Pool Size (but check your DB's connection limit first)
  2. Find and fix slow queries that hold connections too long
  3. Add read replicas and route read queries to them
  4. Use async/await properly — don't block threads (reduces effective connections needed)

Database Connection Limits

Your database also has a maximum connection limit:

| Database | Default max connections | |----------|------------------------| | PostgreSQL | 100 | | MySQL | 151 | | SQL Server | 32,767 | | Azure SQL (S1) | 30 | | Azure SQL (P1) | 200 |

With horizontal scaling, each server instance has its own connection pool. 10 instances × 100 connections = 1000 DB connections. This can easily exhaust a smaller database tier.

Solution: PgBouncer or Azure SQL connection pooler

A connection pooler sits between your app and the database, multiplexing thousands of app connections into a smaller number of actual database connections.

App instances (1000 connections) → PgBouncer → PostgreSQL (100 connections)

Part 5 — Database Scaling

The database is almost always the bottleneck at scale. Here's the progression:

1. Indexes First

Before anything else, add the right indexes. A missing index on a high-traffic query can be 100× slower than needed.

SQL
-- Slow: full table scan
SELECT * FROM orders WHERE customer_id = 'cust-123' ORDER BY created_at DESC;

-- After adding index:
CREATE INDEX CONCURRENTLY idx_orders_customer_date 
ON orders(customer_id, created_at DESC);
-- Now: index seek, 100× faster
C#
// EF Core — add indexes via Fluent API
modelBuilder.Entity<Order>()
    .HasIndex(o => new { o.CustomerId, o.CreatedAt })
    .IsDescending(false, true);

2. Read Replicas

Write to the primary, read from replicas. Most apps are 80–90% reads.

C#
// Two DbContext registrations — primary + read replica
builder.Services.AddDbContext<WriteDbContext>(options =>
    options.UseSqlServer(config["DB:Primary"]));

builder.Services.AddDbContext<ReadDbContext>(options =>
    options.UseSqlServer(config["DB:ReadReplica"])
           .UseQueryTrackingBehavior(QueryTrackingBehavior.NoTracking));

// Usage: write context for commands, read context for queries
public async Task<List<OrderSummary>> GetOrderSummariesAsync(string customerId)
{
    return await _readDb.Orders
        .Where(o => o.CustomerId == customerId)
        .Select(o => new OrderSummary { ... })
        .ToListAsync();
}

3. Database Sharding

When a single database can't handle the volume even with replicas, shard: split data across multiple database instances.

Orders A-M → Database Shard 1
Orders N-Z → Database Shard 2

Sharding strategies:

  • Range sharding — by customer ID range, date range
  • Hash sharding — hash(customer_id) % N — distributes evenly
  • Geo sharding — European users → EU database, US users → US database

Trade-off: Sharding adds significant complexity. Cross-shard queries are hard. Only shard when you have to.

4. CQRS — Separate Read and Write Models

CQRS (Command Query Responsibility Segregation) means separate optimised data models for reads and writes:

Write model: normalised SQL → ensures consistency, handles transactions
Read model:  denormalised views / Redis → fast, purpose-built for each query
C#
// Command handler — writes to normalised SQL
public async Task<Guid> Handle(CreateOrderCommand cmd)
{
    var order = Order.Create(cmd.CustomerId, cmd.Items);
    await _orderRepository.AddAsync(order);
    await _eventBus.PublishAsync(new OrderCreatedEvent(order));
    return order.Id;
}

// Event handler — builds the read model
public async Task Handle(OrderCreatedEvent @event)
{
    await _readDb.ExecuteAsync("""
        INSERT INTO order_summaries (id, customer_id, total, status, created_at)
        VALUES (@Id, @CustomerId, @Total, @Status, @CreatedAt)
        """, @event);
}

// Query handler — reads the denormalised model
public async Task<OrderSummaryDto?> Handle(GetOrderSummaryQuery query)
{
    return await _readDb.QuerySingleOrDefaultAsync<OrderSummaryDto>(
        "SELECT * FROM order_summaries WHERE id = @Id", query);
}

Part 6 — Async Processing and Message Queues

Never make the user wait for things they don't need to wait for.

When an order is placed, the user needs confirmation immediately. But sending the email, updating inventory, generating the PDF receipt — these can happen asynchronously.

WITHOUT async:
User places order → Wait for email → Wait for PDF → Wait for inventory → Response (4 seconds)

WITH async:
User places order → Response (50ms) → Background: email, PDF, inventory

Message Queue Pattern

C#
// Place order API — fast, just saves and publishes
[HttpPost("orders")]
public async Task<IActionResult> PlaceOrder(PlaceOrderRequest request)
{
    var order = await _orderService.CreateAsync(request);
    
    // Publish event — returns immediately
    await _bus.PublishAsync(new OrderPlacedEvent
    {
        OrderId = order.Id,
        CustomerId = order.CustomerId,
        Items = order.Items,
        Total = order.Total
    });

    return Created($"/orders/{order.Id}", new { order.Id });
}

// Background worker — processes events from the queue
public class OrderProcessingWorker : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        await _consumer.ConsumeAsync<OrderPlacedEvent>(async @event =>
        {
            await _emailService.SendOrderConfirmationAsync(@event);
            await _inventoryService.ReserveItemsAsync(@event.Items);
            await _pdfService.GenerateReceiptAsync(@event.OrderId);
        }, ct);
    }
}

Message queue options:

  • Azure Service Bus — enterprise, dead-letter queues, scheduled delivery
  • RabbitMQ — open source, powerful routing
  • Kafka — event streaming at massive scale, replay support
  • AWS SQS — simple managed queue

Part 7 — API Design for Scalability

How you design your API affects how well it scales. These patterns matter.

Pagination — Never Return Unbounded Results

C#
// Bad: returns all orders (could be millions)
[HttpGet("orders")]
public async Task<IActionResult> GetOrders()
{
    return Ok(await _db.Orders.ToListAsync());
}

// Good: cursor-based pagination
[HttpGet("orders")]
public async Task<IActionResult> GetOrders(
    [FromQuery] string? cursor = null,
    [FromQuery] int limit = 20)
{
    limit = Math.Clamp(limit, 1, 100);

    var query = _db.Orders.OrderByDescending(o => o.CreatedAt).AsQueryable();

    if (cursor is not null)
    {
        var cursorDate = DateTimeOffset.Parse(cursor);
        query = query.Where(o => o.CreatedAt < cursorDate);
    }

    var items = await query.Take(limit + 1).ToListAsync();
    var hasMore = items.Count > limit;
    var results = items.Take(limit).ToList();

    return Ok(new
    {
        data = results,
        nextCursor = hasMore ? results.Last().CreatedAt.ToString("O") : null,
        hasMore
    });
}

Rate Limiting — Protect Against Abuse

At scale, a few bad actors or misbehaving clients can overwhelm your system.

C#
// .NET 7+ built-in rate limiting
builder.Services.AddRateLimiter(options =>
{
    // Per-IP: 100 requests per minute
    options.AddPolicy("per-ip", context =>
        RateLimitPartition.GetFixedWindowLimiter(
            context.Connection.RemoteIpAddress?.ToString() ?? "unknown",
            _ => new FixedWindowRateLimiterOptions
            {
                PermitLimit = 100,
                Window = TimeSpan.FromMinutes(1),
                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
                QueueLimit = 10
            }));

    // Authenticated users: 1000 requests per minute
    options.AddPolicy("per-user", context =>
    {
        var userId = context.User.FindFirst(ClaimTypes.NameIdentifier)?.Value;
        if (userId is null)
            return RateLimitPartition.GetNoLimiter("anonymous");

        return RateLimitPartition.GetSlidingWindowLimiter(userId,
            _ => new SlidingWindowRateLimiterOptions
            {
                PermitLimit = 1000,
                Window = TimeSpan.FromMinutes(1),
                SegmentsPerWindow = 6
            });
    });
});

app.UseRateLimiter();

Idempotency — Safe to Retry

At scale, network failures and retries are common. Make operations safe to retry.

C#
// Idempotency key header — client generates a unique ID per request
[HttpPost("orders")]
public async Task<IActionResult> PlaceOrder(
    PlaceOrderRequest request,
    [FromHeader(Name = "Idempotency-Key")] string? idempotencyKey)
{
    if (idempotencyKey is not null)
    {
        // Check if we've already processed this request
        var existing = await _idempotencyStore.GetAsync(idempotencyKey);
        if (existing is not null)
            return Ok(existing); // Return the same response
    }

    var order = await _orderService.CreateAsync(request);

    if (idempotencyKey is not null)
        await _idempotencyStore.SetAsync(idempotencyKey, order,
            TimeSpan.FromHours(24));

    return Created($"/orders/{order.Id}", order);
}

Bulkhead Pattern — Isolate Failures

If your order service depends on a payment service and the payment service goes down, don't let it take down orders. Isolate dependencies with circuit breakers.

C#
// Polly — circuit breaker + retry
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .AddResilienceHandler("payment", pipeline =>
    {
        // Retry 3 times with exponential backoff
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromMilliseconds(200),
            BackoffType = DelayBackoffType.Exponential
        });

        // Circuit breaker: open after 5 failures in 30s
        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,
            SamplingDuration = TimeSpan.FromSeconds(30),
            MinimumThroughput = 5,
            BreakDuration = TimeSpan.FromSeconds(15)
        });

        // 2-second timeout per attempt
        pipeline.AddTimeout(TimeSpan.FromSeconds(2));
    });

Part 8 — Observability: How to Know Your System Is Healthy

At 10 million users, you can't watch logs manually. You need observability — the ability to understand your system's internal state from its external outputs.

The three pillars of observability:

Metrics  → What is happening now (latency, error rate, throughput)
Logs     → What happened and why
Traces   → How a request flowed through the system

The Four Golden Signals

Coined by Google's SRE team, these four metrics tell you almost everything you need to know:

  1. Latency — how long requests take (p50, p95, p99)
  2. Traffic — requests per second
  3. Error rate — percentage of requests returning errors
  4. Saturation — how full the system is (CPU%, memory%, queue depth)

Implementing Observability in .NET

Bash
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.Prometheus.AspNetCore
C#
builder.Services.AddOpenTelemetry()
    .WithMetrics(metrics =>
    {
        metrics
            .AddAspNetCoreInstrumentation()      // HTTP request metrics
            .AddHttpClientInstrumentation()       // Outbound HTTP metrics
            .AddRuntimeInstrumentation()          // CPU, memory, GC
            .AddPrometheusExporter();             // Expose /metrics endpoint
    })
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddSqlClientInstrumentation()        // SQL query tracing
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri("http://jaeger:4317");
            });
    });

// Custom metrics for business events
public class OrderMetrics
{
    private readonly Counter<long> _ordersCreated;
    private readonly Histogram<double> _orderTotal;
    private readonly Gauge<int> _activeOrders;

    public OrderMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("Orders");
        _ordersCreated = meter.CreateCounter<long>("orders.created", description: "Total orders created");
        _orderTotal = meter.CreateHistogram<double>("orders.total", unit: "USD");
        _activeOrders = meter.CreateGauge<int>("orders.active", description: "Currently active orders");
    }

    public void OrderCreated(double total, string region)
    {
        _ordersCreated.Add(1, new TagList { ["region"] = region });
        _orderTotal.Record(total, new TagList { ["region"] = region });
    }
}

What to Measure — The Developer Checklist

Infrastructure:

  • CPU usage per instance
  • Memory usage and GC pressure
  • Disk I/O and network throughput
  • DB connection pool usage

Application:

  • HTTP request latency (p50, p95, p99)
  • HTTP error rate (4xx, 5xx)
  • Cache hit rate
  • Queue depth and processing lag

Business:

  • Orders per minute
  • Checkout success rate
  • Payment failure rate
  • User registration rate

Alerting

Metrics only help if you're alerted when things go wrong. Define SLOs (Service Level Objectives):

YAML
# Example SLO definitions
- name: "API latency"
  condition: p99 > 500ms
  for: 5 minutes
  severity: warning

- name: "Error rate"
  condition: error_rate > 1%
  for: 2 minutes
  severity: critical

- name: "DB connections"
  condition: pool_usage > 80%
  for: 3 minutes
  severity: warning

Alerting tools:

  • Azure: Azure Monitor + Action Groups
  • Open source: Prometheus + Alertmanager + PagerDuty
  • Grafana: unified dashboards across all sources

Distributed Tracing — Follow a Request Across Services

At microservices scale, a single user request might touch 10 services. Distributed tracing gives you a complete picture:

User → API Gateway (5ms)
         → Auth Service (3ms)
         → Order Service (45ms)
              → Database query (30ms)
              → Cache lookup (0.5ms)
              → Payment Service (12ms)
              → Service Bus publish (2ms)

With OpenTelemetry, this trace is collected automatically for HTTP, SQL, and gRPC calls. In Jaeger or Azure Application Insights, you can see the full waterfall.


Part 9 — What Developers Should Think About When Designing APIs

Before writing a single line of code, ask these questions:

1. What's the read/write ratio?

80% reads → invest in caching. 50/50 → invest in async writes and CQRS.

2. What data can be stale?

Product listings: 5-minute stale is fine. Payment status: never stale. Design TTLs accordingly.

3. What's the worst-case query?

The query that runs when the most data exists and filtering is least effective. Add an index for it now.

4. What happens when a downstream service is down?

Design for graceful degradation. Circuit breakers, fallbacks, queue-based retry.

5. What happens at 10× traffic?

Identify the bottleneck:

  • CPU-bound → horizontal scale
  • DB-bound → cache, read replicas, sharding
  • I/O-bound → async, queues
  • Memory-bound → vertical scale, reduce allocations

6. How do you know when it's broken?

Add metrics and health checks from day one. Don't build a monitoring strategy after the fire.

C#
// Health checks — register all dependencies
builder.Services.AddHealthChecks()
    .AddDbContextCheck<AppDbContext>("database")
    .AddRedis(config["Redis:ConnectionString"]!, "redis")
    .AddAzureServiceBusTopic(config["ServiceBus:ConnectionString"]!, "orders", "service-bus")
    .AddCheck<CriticalDependencyHealthCheck>("payment-gateway");

app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = WriteMinimalHealthCheckResponse
});

7. Are your writes idempotent?

Distributed systems have retries. Make all write operations safe to retry.

8. Can you deploy without downtime?

Use zero-downtime deployments: blue/green, rolling, or deployment slots. Blue/green is easiest in Azure:

Bash
# Azure App Service deployment slots
az webapp deployment slot create --name my-api --resource-group MyRG --slot staging

# Deploy to staging
az webapp deploy --resource-group MyRG --name my-api --slot staging --src-path ./app.zip

# Test staging
curl https://my-api-staging.azurewebsites.net/health

# Swap to production (zero downtime)
az webapp deployment slot swap --name my-api --resource-group MyRG \
  --slot staging --target-slot production

The Scaling Checklist

Use this when reviewing a system or designing a new API:

Performance:

  • [ ] All high-traffic queries have indexes
  • [ ] N+1 query problems identified and fixed
  • [ ] Pagination on all list endpoints (cursor-based for large datasets)
  • [ ] Database queries have time bounds (query timeout)

Caching:

  • [ ] Cacheable data identified and TTLs defined
  • [ ] Cache invalidation strategy documented
  • [ ] Cache hit rate monitored
  • [ ] No sensitive data in cache

Scaling:

  • [ ] Application is stateless (no in-memory session)
  • [ ] Sessions/tokens use distributed store
  • [ ] Auto-scaling configured with tested scale-in/out conditions
  • [ ] Database connection pooling configured and monitored

Resilience:

  • [ ] Circuit breakers on all downstream HTTP calls
  • [ ] Retry with exponential backoff on transient failures
  • [ ] Dead-letter queues for failed messages
  • [ ] Graceful shutdown handles in-flight requests

Observability:

  • [ ] Health check endpoint (/health)
  • [ ] Request latency (p50, p95, p99) tracked
  • [ ] Error rate tracked and alerted
  • [ ] Business KPIs tracked as custom metrics
  • [ ] Distributed tracing enabled

Security:

  • [ ] Rate limiting per user and per IP
  • [ ] Request size limits (prevent large payload attacks)
  • [ ] Input validation on all endpoints
  • [ ] Secrets in Key Vault, not config files

Key Takeaways

  1. Profile before you scale — understand what's slow before adding servers
  2. Cache aggressively — most read performance problems are a cache problem
  3. Design for statelessness — it's what makes horizontal scaling possible
  4. Queue the non-critical work — never make users wait for background tasks
  5. Observe everything — you can't fix what you can't see
  6. Design for failure — every external dependency will fail at some point
  7. Scale the database last — it's the hardest to scale; exhaust other options first
  8. Connection pooling is free performance — configure it correctly from the start

The system that serves 10 million users isn't 10,000× more complex than the one serving 1,000. It's the same patterns applied layer by layer as load grows. Start with good fundamentals, add each layer when you need it, and measure everything.

Enjoyed this article?

Explore the learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.