Learnixo

.NET & C# Development · Lesson 143 of 229

Interview Prep: Principal & Staff Level — System Design & Architecture

.NET / C# Interview Questions: Principal & Staff Level

Principal and staff engineer interviews test architectural judgement, not just technical knowledge. You are expected to discuss trade-offs, justify decisions, and demonstrate that you have designed systems under real constraints.


How to Use This Guide

These questions assume 7+ years of experience. For each:

  • State your recommendation first
  • Explain the trade-offs of alternatives
  • Describe how you would decide given different constraints
  • Use specific examples from real systems where possible

System Design

SD1: Design a rate limiter for a public API in .NET.

Start by clarifying requirements: per-user or per-IP? Sliding window or fixed window? Hard reject or soft queue? Global (across pods) or local?

Algorithm options:
  Fixed window:   simple, but burst at window boundary (100 req in last second of window + 100 in first second of next = 200 in 2 seconds)
  Sliding window: accurate but expensive — stores per-request timestamps
  Token bucket:   smooth bursting — refills at a rate, allows short bursts
  Leaky bucket:   constant output rate — good for queue-based systems

For a public API with horizontal pods → must use Redis (shared state):
C#
// Token bucket in Redis using Lua script (atomic read-modify-write)
public class RedisRateLimiter(IConnectionMultiplexer redis)
{
    private const string LuaScript = """
        local key        = KEYS[1]
        local rate       = tonumber(ARGV[1])   -- tokens per second
        local capacity   = tonumber(ARGV[2])   -- max burst
        local now        = tonumber(ARGV[3])   -- current timestamp (ms)
        local requested  = tonumber(ARGV[4])   -- tokens requested

        local last_tokens = tonumber(redis.call('hget', key, 'tokens') or capacity)
        local last_time   = tonumber(redis.call('hget', key, 'time')   or now)

        local elapsed  = math.max(0, now - last_time) / 1000
        local tokens   = math.min(capacity, last_tokens + elapsed * rate)

        if tokens >= requested then
            tokens = tokens - requested
            redis.call('hmset', key, 'tokens', tokens, 'time', now)
            redis.call('expire', key, math.ceil(capacity / rate) + 1)
            return 1   -- allowed
        end

        return 0   -- rejected
        """;

    public async Task<bool> IsAllowedAsync(string clientId, int rate = 100, int capacity = 200)
    {
        var db     = redis.GetDatabase();
        var key    = $"ratelimit:{clientId}";
        var now    = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds();
        var result = (int)await db.ScriptEvaluateAsync(LuaScript,
            keys: [key],
            values: [rate, capacity, now, 1]);
        return result == 1;
    }
}

// In ASP.NET Core middleware — reject before hitting controllers
// Also consider: built-in .NET 7+ rate limiting (RateLimiterMiddleware)
// with AddTokenBucketLimiter / AddSlidingWindowLimiter

Trade-offs to discuss:

  • Local rate limiting (no Redis): faster, but each pod has independent limits — a client can hit N×limit across N pods
  • Redis with Lua: atomic, accurate, adds ~1ms latency per request, Redis becomes a bottleneck
  • Sliding window log: accurate but O(requests) memory per client
  • For 99% of cases: .NET 7 built-in RateLimiterMiddleware with a sliding window and Redis backing is the right answer

SD2: Design the OrderFlow system to handle 100,000 orders per minute.

100,000 orders/min = ~1,667 orders/sec peak

Bottlenecks in sequence:
1. API → order validation + DB write = synchronous path
2. Inventory check → external or internal service call
3. Payment processing → external, slow (200ms–2s)
4. Notifications → email/SMS — fire-and-forget

Architecture:
  [Client]
     │  REST POST /orders (idempotency-key header)
     ▼
  [API pods × 10] — stateless, horizontal
     │  validate, write to DB (command)
     │  publish to message queue
     ▼
  [Message Queue] — RabbitMQ / Azure Service Bus
     │
     ├─► [InventoryWorker pods × 5] — reserve stock
     ├─► [PaymentWorker pods × 5]   — charge card, retry on failure
     └─► [NotificationWorker × 3]   — email/SMS (low priority)

Database strategy:
  - Write: PostgreSQL primary (write path only)
  - Read:  PostgreSQL replica (order history queries)
  - Connection pool: max 100 connections per API pod, 10 pods = 1000 max
  - Alternative: CockroachDB or Cosmos DB for global distribution

Throughput math:
  1,667 writes/sec → 1 DB primary can handle 5,000–10,000 simple writes/sec
  → Single primary is fine; add sharding only if > 50k/sec sustained

SD3: Design a distributed lock in .NET.

Use case: prevent duplicate processing in a distributed system
(two workers picking up the same job simultaneously)

Option 1: Redis SETNX (SET if Not eXists) + expiry
  - Simple, fast, widely used
  - Risk: lock holder crashes → expiry prevents deadlock
  - Risk: lock expires while holder is still working → two holders

Option 2: Redlock (multi-node Redis)
  - Acquire lock on majority of N Redis nodes
  - Safer for high-stakes operations
  - More complex, still has edge cases under network partition
  - Martin Kleppmann's critique: use ZooKeeper or etcd for true distributed locks

Option 3: Database advisory locks (PostgreSQL pg_try_advisory_lock)
  - If you already have PostgreSQL → no extra infrastructure
  - Tied to DB connection lifetime — lock released on connection close
C#
// Redis distributed lock with StackExchange.Redis
public class RedisDistributedLock(IConnectionMultiplexer redis)
{
    public async Task<bool> TryAcquireAsync(string resource, TimeSpan ttl, string lockValue)
    {
        var db = redis.GetDatabase();
        return await db.StringSetAsync(
            $"lock:{resource}",
            lockValue,          // unique per holder (e.g., Guid)
            ttl,
            When.NotExists);    // SETNX — only set if not already locked
    }

    public async Task ReleaseAsync(string resource, string lockValue)
    {
        // Lua: only delete if we own it (prevents releasing another holder's lock)
        const string lua = """
            if redis.call('get', KEYS[1]) == ARGV[1] then
                return redis.call('del', KEYS[1])
            else
                return 0
            end
            """;
        var db = redis.GetDatabase();
        await db.ScriptEvaluateAsync(lua,
            keys: [$"lock:{resource}"],
            values: [lockValue]);
    }
}

// Usage with automatic release
public async Task ProcessJobAsync(int jobId)
{
    var lockValue = Guid.NewGuid().ToString();
    var resource  = $"job:{jobId}";

    if (!await _lock.TryAcquireAsync(resource, TimeSpan.FromSeconds(30), lockValue))
    {
        logger.LogInformation("Job {JobId} already locked — skipping", jobId);
        return;
    }

    try { await DoWorkAsync(jobId); }
    finally { await _lock.ReleaseAsync(resource, lockValue); }
}

SD4: Design an event sourcing system in .NET.

Event sourcing: store state as a sequence of events, not current state.
  Traditional:   Orders table has { Id, Status, Total } — current snapshot
  Event sourced: EventStore has { OrderId, EventType, Payload, Timestamp }
                 Current state = replay all events for an OrderId

When to use event sourcing:
  ✓ Full audit trail is a business requirement (financial, healthcare)
  ✓ Time travel — reconstruct state at any point in history
  ✓ Event-driven architecture — events are already the core abstraction
  ✓ Debugging — replay events to reproduce bugs exactly

When NOT to use:
  ✗ Simple CRUD with no audit requirements — massive over-engineering
  ✗ Team unfamiliar with the pattern — steep learning curve
  ✗ High read throughput — queries require projection/read model updates
C#
// Event store — append-only
public interface IEventStore
{
    Task AppendAsync(string streamId, IEnumerable<IDomainEvent> events, int expectedVersion, CancellationToken ct);
    Task<IReadOnlyList<IDomainEvent>> LoadAsync(string streamId, CancellationToken ct);
}

// Aggregate rebuilt from events
public class Order
{
    public int    Id      { get; private set; }
    public string Status  { get; private set; } = "";
    private int   _version = 0;

    public static Order Rehydrate(IEnumerable<IDomainEvent> events)
    {
        var order = new Order();
        foreach (var e in events)
            order.Apply(e);
        return order;
    }

    private void Apply(IDomainEvent e) => _ = e switch
    {
        OrderCreatedEvent   created  => Apply(created),
        OrderPaidEvent      paid     => Apply(paid),
        OrderShippedEvent   shipped  => Apply(shipped),
        _                            => this,
    };

    private Order Apply(OrderCreatedEvent e) { Id = e.OrderId; Status = "Pending"; _version++; return this; }
    private Order Apply(OrderPaidEvent    e) { Status = "Paid";     _version++; return this; }
    private Order Apply(OrderShippedEvent e) { Status = "Shipped";  _version++; return this; }
}

// Read model (projection) — built asynchronously from events
public class OrderReadModelProjection
{
    public async Task HandleAsync(OrderCreatedEvent e)
    {
        await _readDb.UpsertAsync(new OrderReadModel(e.OrderId, "Pending", e.CustomerId));
    }
}

Distributed Systems

DS1: What is the CAP theorem and how does it apply to .NET architecture decisions?

CAP theorem states a distributed system can guarantee at most two of three: Consistency, Availability, Partition tolerance. Since network partitions happen in any distributed system, the real choice is CP vs AP.

CP (Consistent + Partition-tolerant):
  - On partition: refuse writes to stay consistent
  - Examples: ZooKeeper, etcd, CockroachDB (strong consistency mode)
  - .NET use case: distributed lock, leader election, configuration store

AP (Available + Partition-tolerant):
  - On partition: continue accepting writes, risk inconsistency
  - Examples: Cassandra, DynamoDB, Redis (asynchronous replication)
  - .NET use case: shopping cart, session state, caching

In practice for a .NET microservice system:
  - Order state: CP — you cannot oversell inventory (prefer consistency)
  - Product catalogue: AP — a stale cache is fine; availability matters more
  - User sessions: AP — better to show stale data than log the user out

DS2: Explain the Saga pattern and when to use it over a distributed transaction.

Distributed transactions (2PC — Two-Phase Commit):
  - All services must agree before any commit
  - One slow/down service blocks all others
  - Tight coupling — all services must support 2PC
  - Almost never correct for microservices

Saga pattern:
  - Each service executes a local transaction and publishes an event
  - If a step fails, compensating transactions undo prior steps
  - Two implementations:

  Choreography (event-driven):
    OrderService → publishes OrderCreated
    InventoryService → consumes, reserves stock, publishes StockReserved
    PaymentService → consumes, charges card, publishes PaymentProcessed
    OrderService → consumes, marks order Confirmed
    IF PaymentFailed → publishes PaymentFailed
    InventoryService → consumes, releases reservation (compensating transaction)

  Orchestration (central coordinator):
    OrderOrchestrator sends commands to each service
    Easier to track state, but orchestrator can become a bottleneck

When to use:
  Saga: cross-service business workflows with eventual consistency tolerance
  2PC: single-DB multi-table operations (use a DB transaction instead)

DS3: How do you handle exactly-once message processing in .NET?

Message brokers guarantee at-least-once delivery — duplicates are inevitable.
Exactly-once is achieved by making consumers idempotent, not by the broker.

Strategy 1: Natural idempotency
  UPSERT instead of INSERT — running twice produces same result
  State machines — transitioning from Pending→Paid twice is a no-op

Strategy 2: Deduplication table
  CREATE TABLE ProcessedMessages (MessageId UUID PRIMARY KEY, ProcessedAt TIMESTAMP)
  Before processing: INSERT ... ON CONFLICT DO NOTHING → RETURNING
  If 0 rows inserted: message already processed, skip
  If 1 row inserted: first time, process and commit in same transaction

Strategy 3: Idempotency key (for API calls)
  Client sends Idempotency-Key header
  Server stores result by key for 24h
  Duplicate request → return cached result without re-executing

The key insight: idempotency + deduplication together = exactly-once semantics
  without any changes to the broker or protocol

Architecture Decisions

AD1: When would you choose a monolith over microservices?

Start with a monolith when:
  - Team is small (< 10 engineers) — coordination overhead of microservices exceeds benefit
  - Domain is not yet well understood — premature service boundaries are expensive to undo
  - Time-to-market matters — one deployment, one repo, one test suite
  - You don't have DevOps maturity — microservices require K8s, service mesh, distributed tracing

Move to microservices when:
  - Different services have genuinely different scaling needs (payment vs. catalogue)
  - Different teams own different services and need independent deploys
  - A monolith is causing deployment bottlenecks (everyone blocked by one release)
  - You have a clear bounded context that could be extracted without cross-cutting concerns

The trap: "distributed monolith" — microservices that are still tightly coupled
  at the database or API level, with all the microservice complexity but none of the benefits

Practical path:
  1. Modular monolith — strict module boundaries in one codebase
  2. Extract services only when specific bounded contexts have demonstrated need
  3. Never split just because it "feels" like microservices

AD2: How would you migrate a large .NET monolith to microservices without downtime?

The Strangler Fig pattern:

Phase 1 — Introduce a routing layer (YARP or Nginx):
  All traffic still goes to the monolith
  The proxy can re-route specific paths to new services as they're ready

Phase 2 — Extract one bounded context at a time:
  Choose the context with the clearest boundaries and lowest coupling
  Build the new service alongside the monolith (not instead of it)
  Run both simultaneously, switch traffic via the proxy
  Keep a kill switch — can re-route back to monolith immediately

Phase 3 — Event bridge for shared state:
  New service and monolith may need to share data during transition
  Change Data Capture (CDC) on the monolith DB publishes events
  New service consumes events to build its own read model

Phase 4 — Remove monolith code once new service is stable:
  Only delete code after 2+ weeks of stable traffic to the new service
  Keep DB tables until all consumers have migrated

Timeline reality:
  Each service extraction: 2–6 months for a well-scoped bounded context
  Full migration of a large system: 2–4 years
  Most teams stop at 3–5 services — that is fine

AD3: You inherit a .NET system with severe performance problems. How do you approach it?

Never optimise without measuring.

Step 1: Measure (before touching code)
  - APM tool: Application Insights, Datadog, or Jaeger traces
  - Identify the slowest endpoints by p95/p99 latency
  - Identify highest-frequency queries in the database (pg_stat_statements)
  - Check memory allocation: dotnet-counters, EventPipe, PerfView

Step 2: Find the actual bottleneck (it's almost always one of these)
  - N+1 queries — most common; fix with EF Core Include or a JOIN
  - Missing database index — check query plans (EXPLAIN ANALYZE)
  - Synchronous I/O blocking the thread pool — blocking .Result or .Wait()
  - Memory pressure — large allocations, LOH, GC pauses
  - Chatty service calls — N HTTP calls inside a loop

Step 3: Fix and verify
  - Fix one thing at a time
  - Re-measure after each fix — never assume a fix helped
  - Load test with k6 or NBomber — prove the improvement holds under load

Step 4: Structural improvements (if needed)
  - Add caching at the right layer (Redis for shared state, HybridCache for L1+L2)
  - Add CQRS read models if the query access pattern differs from the write model
  - Move expensive operations to background jobs (async + message queue)
  - Consider read replicas if DB reads are the bottleneck

Technical Leadership

TL1: How do you decide when to use a third-party library vs. building in-house?

Default: use the library. Build in-house only when:
  - The library does not exist for your specific need
  - The library's API is fundamentally incompatible with your model
  - Licensing prevents use (GPL in commercial software)
  - The library is unmaintained and has unpatched CVEs

Evaluation criteria for any library:
  - NuGet download count and GitHub stars (proxy for community support)
  - Last commit date — unmaintained = future security debt
  - Issue tracker — how are critical bugs handled?
  - Breaking change policy — does the maintainer respect semver?
  - Does it ship its own transitive dependencies? (DLL hell risk)

For .NET specifically:
  ORM:          EF Core (Microsoft-backed, excellent for most cases)
                Dapper (high-performance, lightweight — not competing with EF Core)
  Validation:   FluentValidation (widely used, active)
  Messaging:    MassTransit (abstracts RabbitMQ/Azure SB), not raw client SDK
  Mapping:      Mapster (source-generated, fast) > AutoMapper (slow at runtime)
  Testing:      xUnit + NSubstitute + FluentAssertions (proven combination)

What not to use:
  AutoMapper for complex mappings — use explicit mapping methods instead
  MediatR for every use case — overkill for simple CRUD, right tool for CQRS

TL2: How do you conduct an architecture review on a teammate's design?

Good architecture reviews are collaborative, not adversarial.

Framework: PRISM
  P — Problem: Is the problem statement clear? Does the design solve the right problem?
  R — Requirements: Does it meet functional and non-functional requirements?
       (Latency, throughput, SLA, data retention, compliance)
  I — Interfaces: Are service boundaries clean? Do APIs make sense to consumers?
  S — Scalability: What is the bottleneck? How does it behave at 10× load?
  M — Maintainability: Can the team operate and evolve this in 2 years?

Questions to always ask:
  "What happens when [dependency X] is down?"
  "What happens at 10× current load?"
  "How does this get deployed with zero downtime?"
  "How do we debug a problem in production?"
  "What is the rollback plan if this goes wrong?"

What to avoid:
  Bike-shedding on naming or style in a design review
  Requiring your preferred pattern when the proposed one also works
  Making it personal — critique the design, not the designer

TL3: A junior developer on your team is committing large, hard-to-review PRs. How do you address it?

Never criticise someone's process without helping them understand why it matters
and giving them a specific, actionable alternative.

1. First: understand the root cause
   - Are they intimidated by multiple small PRs?
   - Are they unclear on what "one concern" means for a PR?
   - Is the task itself poorly scoped (too large)?

2. Teach the principle, not just the rule
   "A PR should answer one question: did this change make the system better in one specific way?
    A 2,000-line PR makes that question unanswerable."

3. Pair on breaking it down
   Work with them to identify the seams in their next feature:
   - Schema migration → PR 1
   - Repository layer → PR 2
   - Application layer → PR 3
   - API endpoint + tests → PR 4

4. Acknowledge the difficulty
   "Breaking work into small, always-releasable units is a skill that takes time to develop.
    It's one of the hardest things to learn as an engineer."

5. Set clear expectations going forward
   "PRs over 400 lines need a decomposition plan in the description.
    Let's agree on that as a team standard."

.NET-Specific Deep Dives

DN1: How does the .NET garbage collector work and how do you tune it for a high-throughput API?

.NET GC generations:
  Gen 0: short-lived objects — collected frequently, fast (< 1ms)
  Gen 1: survived Gen 0 — medium lifetime
  Gen 2: long-lived objects — collected infrequently, can pause 10s of ms
  LOH:   Large Object Heap — objects > 85KB — never compacted by default, causes fragmentation

For a high-throughput API:
  Goal: minimise Gen 2 and LOH collections

Tuning options:
  1. Server GC mode (default for ASP.NET Core — already on)
     One heap per CPU core, concurrent GC, minimal pause times

  2. Reduce allocations on hot paths
     Use ArrayPool.Shared instead of new T[]
     Use MemoryPool for byte buffers
     Use Span for string parsing — no allocation
     Use struct for short-lived data (avoid boxing)

  3. Avoid large allocations
     Objects > 85KB go to LOH — pre-allocate and reuse buffers
     JSON serialisation of large responses: use Utf8JsonWriter with a rented array

  4. GC server settings in runtimeconfig.json
     "GCHeapHardLimitPercent": 75   -- stay under 75% of container memory
     "GCConserveMemory": 5          -- 0–9 scale; higher = more aggressive GC

  5. Profile first
     dotnet-counters watch System.Runtime
     Watch: gen-0-gc-count, gen-1-gc-count, gen-2-gc-count, loh-size
     A spike in gen-2-gc-count with high latency = GC pauses are your bottleneck

DN2: Explain how EF Core's change tracker works and when it causes problems.

EF Core tracks every loaded entity in a Dictionary.
On SaveChanges, it scans tracked entities, compares current vs original values,
and generates SQL for Added/Modified/Deleted entries.

Problems:
  1. Memory: long-lived DbContext (Singleton scope) accumulates tracked entities
     Fix: Scoped DbContext (default with AddDbContext)

  2. Performance: tracking overhead for read-only queries
     Fix: AsNoTracking() or UseQueryTrackingBehavior(NoTracking)

  3. Update behaviour: context.Update(entity) marks ALL properties Modified
     → generates UPDATE with ALL columns, even unchanged ones
     Fix: load tracked entity → modify → SaveChanges
     → EF Core generates UPDATE with only changed columns

  4. Graph tracking: loading an entity with Includes tracks ALL included entities
     Fix: use projection (Select) instead of Include for read-only data

  5. DetectChanges performance: SaveChanges calls DetectChanges which scans all tracked entities
     On a context with 10,000 tracked entities, this is O(n)
     Fix: context.ChangeTracker.AutoDetectChangesEnabled = false; then call DetectChanges manually

DN3: How would you implement a multi-region .NET deployment with data sovereignty?

Data sovereignty: customer data must stay in a specific geography (EU, US, APAC)
  Required by: GDPR (EU), data residency contracts, government mandates

Architecture:

  Global tier (no PII):
    Azure Front Door / Cloudflare — global load balancer
    Routes requests to the correct regional deployment based on:
      a) User's JWT tenant_id claim (maps to region in tenant store)
      b) Subdomain (eu.api.example.com, us.api.example.com)

  Regional tier (per region, e.g., EU):
    .NET API pods in West Europe AKS cluster
    PostgreSQL with Flexible Server in West Europe (no geo-replication for PII)
    Redis Cache in West Europe
    Azure Service Bus in West Europe

  Tenant routing service (global, no PII):
    Maps tenant_id → region
    Cached heavily, small dataset

Implementation in .NET:
  1. API receives request
  2. JWT decoded → tenant_id extracted
  3. TenantStore.GetRegionAsync(tenantId) → "eu" | "us" | "apac"
  4. If request landed on wrong region: return 307 redirect to correct region endpoint
  5. If correct region: process normally — all data stays local

Cross-region concerns:
  - Analytics/reporting: aggregate anonymised data only — never PII
  - Backups: encrypted, stored in same region (no cross-region replication of PII)
  - Disaster recovery: within-region replica only

Interview Answer Templates

The "Tell me about a technically complex system you designed" answer structure:

1. Context: What was the business problem? Why did it matter?
2. Constraints: What made it hard? (Scale, latency, team size, timeline, existing system)
3. Options considered: What alternatives did you evaluate? Why did you rule them out?
4. Decision: What did you choose and why? What were the acknowledged trade-offs?
5. Outcome: What was the result? What would you do differently?

Example structure (30–60 seconds):
  "We needed to process 50,000 webhook deliveries per minute with at-least-once
   guarantee and per-tenant isolation. The constraint was that we couldn't change
   the database schema because 20 other services depended on it.

   I considered three approaches: [list them briefly]. We ruled out approach A because
   [specific reason]. Approach B was viable but added [trade-off].

   We chose C — [specific design] — accepting [specific trade-off] because [business reason].

   The result: [measurable outcome]. If I did it again, I would [specific improvement]."

The "How do you approach a bug you can't reproduce" answer structure:

1. Gather evidence before touching code: logs, traces, metrics, error rates
2. Form a hypothesis based on the evidence — don't guess randomly
3. Instrument the code to gather more evidence if needed
4. Reproduce in a lower environment using production data shape (Testcontainers)
5. Fix the smallest possible change that addresses the root cause
6. Add a regression test before deploying
7. Monitor after deploy — confirm the metric changed

Saying "I would add more logging and traces first" scores higher than
"I would reproduce it locally" — principal engineers think observability-first.