Design a High-Throughput Event Ingestion API in .NET

The Interview Question

"Design an API that ingests 50,000 events per second from IoT devices and webhooks. Events must be persisted reliably with low latency on the accept path. How would you build this in .NET?"

This separates candidates who know ASP.NET request handling from those who understand backpressure, batching, and when EF Core is the wrong tool.

Step 1: Requirements

Functional

POST /events — accept JSON event batch (1–100 events per request)
Events queryable within 30 seconds for dashboards
Dead-letter queue for malformed payloads

Non-functional

50,000 events/sec sustained, bursts to 80K
Accept path p99 under 20ms (return 202 quickly)
No event loss on process crash (durability after ACK)
Horizontal scale across 10 API instances

Event size: ~500 bytes average → ~25 MB/sec raw ingress.

Step 2: Wrong Approaches

Synchronous EF SaveChanges per request. One DB round-trip per HTTP request at 50K RPS = database meltdown. Even at 1K RPS, connection pool exhaustion is likely.

Unbounded in-memory queue. Process crash = lost events. OOM under burst traffic.

Single auto-increment ID generator. Bottleneck; use Snowflake-style IDs or DB-independent UUIDs.

Step 3: Accept Fast, Process Async

POST /events
  1. Validate schema (minimal — type, timestamp, tenantId)
  2. Write to Channel (bounded capacity)
  3. Return 202 Accepted immediately

Background consumer:
  4. Drain channel in batches of 500
  5. Bulk INSERT to PostgreSQL (COPY or multi-row INSERT)
  6. On channel full → return 503 with Retry-After

public class EventIngestionService
{
    private readonly Channel<EventBatch> _channel;

    public EventIngestionService()
    {
        _channel = Channel.CreateBounded<EventBatch>(new BoundedChannelOptions(10_000)
        {
            FullMode = BoundedChannelFullMode.Wait,
            SingleReader = false,
            SingleWriter = false,
        });
    }

    public async ValueTask<bool> TryEnqueueAsync(EventBatch batch, CancellationToken ct)
    {
        return await _channel.Writer.WaitToWriteAsync(ct).AsTask()
            && _channel.Writer.TryWrite(batch);
    }
}

[ApiController]
[Route("events")]
public class EventsController(EventIngestionService ingestion) : ControllerBase
{
    [HttpPost]
    public async Task<IActionResult> Ingest([FromBody] EventBatchDto dto, CancellationToken ct)
    {
        var batch = EventBatch.From(dto);
        if (!await ingestion.TryEnqueueAsync(batch, ct))
            return StatusCode(503, new { retryAfterSeconds = 5 });

        return Accepted(new { batchId = batch.Id });
    }
}

Step 4: Batched Persistence

public class EventBatchWriter : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        await foreach (var batch in _channel.Reader.ReadAllAsync(ct))
        {
            _buffer.AddRange(batch.Events);
            if (_buffer.Count >= 500 || _flushTimer.Elapsed > TimeSpan.FromMilliseconds(100))
                await FlushAsync(ct);
        }
    }

    private async Task FlushAsync(CancellationToken ct)
    {
        // Dapper bulk insert — 10x faster than EF per-row
        const string sql = """
            INSERT INTO events (id, tenant_id, type, payload, received_at)
            SELECT * FROM UNNEST(@ids, @tenantIds, @types, @payloads::jsonb, @receivedAt)
            """;

        await _connection.ExecuteAsync(sql, new { /* arrays */ });
        _buffer.Clear();
    }
}

EF Core vs Dapper vs COPY — Decision Matrix

| Tool | Throughput | When to use | |------|------------|-------------| | EF Core AddRange + SaveChanges | Low | Complex domain logic per event | | Dapper multi-row INSERT | High | Simple event log, no change tracking | | PostgreSQL COPY / BinaryImporter | Highest | 50K+ rows/sec, append-only log |

Interview answer: "EF Core is wrong for the hot ingestion path — I'd use Dapper or Npgsql COPY. EF stays on the read/admin side if needed."

Step 5: Architecture

  Devices / Webhooks
         │
         ▼
┌─────────────────┐   bounded    ┌──────────────────┐
│  Kestrel (×10)  │──Channel─▶│  BatchWriter     │
│  POST /events   │              │  (per instance)  │
│  → 202 Accepted │              └────────┬─────────┘
└─────────────────┘                       │
                                          ▼
                                 ┌─────────────────┐
                                 │  PostgreSQL     │
                                 │  events (partitioned by day)
                                 └─────────────────┘

Partitioning: PARTITION BY RANGE (received_at) — drop old partitions instead of DELETE.

Alternative at higher scale: Accept path writes to Kafka / Azure Event Hubs, separate consumer group does DB writes. Adds operational complexity but decouples ingest from persistence completely.

Step 6: Backpressure & Health

// Health check — fail readiness if channel is 90% full
public class IngestionHealthCheck : IHealthCheck
{
    public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext ctx, CancellationToken ct)
    {
        var utilization = (double)_channel.Count / _channel.Capacity;
        return Task.FromResult(utilization > 0.9
            ? HealthCheckResult.Degraded("Ingestion backlog high")
            : HealthCheckResult.Healthy());
    }
}

Kubernetes removes overloaded pods from rotation → load shifts to healthy instances.

Rate limiting: Per-tenant token bucket in Redis — prevents one tenant from starving others.

Step 7: Durability Trade-off

| ACK timing | Durability | Latency | |------------|------------|---------| | 202 after Channel write | Lost if process crashes before flush | Lowest | | 202 after DB commit | Fully durable | Higher (defeats purpose) | | 202 after Channel + WAL spill to disk | Durable with local file | Medium |

Production pattern: Write to Channel, background flusher persists to DB. For stricter durability, spill to a local append-only file (or Redis Stream) before ACK.

What Interviewers Are Testing

Separation of accept and persist paths — never block HTTP on DB
Channel<T> — bounded, backpressure, in-process producer-consumer
Batching — amortise DB round-trips
Tool choice — articulate why not EF on hot path
Horizontal scale — stateless API, partition-aware DB
Observability — channel depth metric, flush latency, events/sec

Strong closing: "I'd load-test with NBomber or k6, watch Gen2 GC and pool exhaustion, and only add Kafka when single-region PostgreSQL batching stops meeting SLO — not before."