.NET / C# Interview Questions: Principal & Staff Level
Principal and staff engineer .NET interview questions β system design, distributed systems, architectural trade-offs, migration strategy, team-level decisions, and deep technical leadership questions.
.NET / C# Interview Questions: Principal & Staff Level
Principal and staff engineer interviews test architectural judgement, not just technical knowledge. You are expected to discuss trade-offs, justify decisions, and demonstrate that you have designed systems under real constraints.
How to Use This Guide
These questions assume 7+ years of experience. For each:
- State your recommendation first
- Explain the trade-offs of alternatives
- Describe how you would decide given different constraints
- Use specific examples from real systems where possible
System Design
SD1: Design a rate limiter for a public API in .NET.
Start by clarifying requirements: per-user or per-IP? Sliding window or fixed window? Hard reject or soft queue? Global (across pods) or local?
Algorithm options:
Fixed window: simple, but burst at window boundary (100 req in last second of window + 100 in first second of next = 200 in 2 seconds)
Sliding window: accurate but expensive β stores per-request timestamps
Token bucket: smooth bursting β refills at a rate, allows short bursts
Leaky bucket: constant output rate β good for queue-based systems
For a public API with horizontal pods β must use Redis (shared state):// Token bucket in Redis using Lua script (atomic read-modify-write)
public class RedisRateLimiter(IConnectionMultiplexer redis)
{
private const string LuaScript = """
local key = KEYS[1]
local rate = tonumber(ARGV[1]) -- tokens per second
local capacity = tonumber(ARGV[2]) -- max burst
local now = tonumber(ARGV[3]) -- current timestamp (ms)
local requested = tonumber(ARGV[4]) -- tokens requested
local last_tokens = tonumber(redis.call('hget', key, 'tokens') or capacity)
local last_time = tonumber(redis.call('hget', key, 'time') or now)
local elapsed = math.max(0, now - last_time) / 1000
local tokens = math.min(capacity, last_tokens + elapsed * rate)
if tokens >= requested then
tokens = tokens - requested
redis.call('hmset', key, 'tokens', tokens, 'time', now)
redis.call('expire', key, math.ceil(capacity / rate) + 1)
return 1 -- allowed
end
return 0 -- rejected
""";
public async Task<bool> IsAllowedAsync(string clientId, int rate = 100, int capacity = 200)
{
var db = redis.GetDatabase();
var key = $"ratelimit:{clientId}";
var now = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds();
var result = (int)await db.ScriptEvaluateAsync(LuaScript,
keys: [key],
values: [rate, capacity, now, 1]);
return result == 1;
}
}
// In ASP.NET Core middleware β reject before hitting controllers
// Also consider: built-in .NET 7+ rate limiting (RateLimiterMiddleware)
// with AddTokenBucketLimiter / AddSlidingWindowLimiterTrade-offs to discuss:
- Local rate limiting (no Redis): faster, but each pod has independent limits β a client can hit NΓlimit across N pods
- Redis with Lua: atomic, accurate, adds ~1ms latency per request, Redis becomes a bottleneck
- Sliding window log: accurate but O(requests) memory per client
- For 99% of cases: .NET 7 built-in
RateLimiterMiddlewarewith a sliding window and Redis backing is the right answer
SD2: Design the OrderFlow system to handle 100,000 orders per minute.
100,000 orders/min = ~1,667 orders/sec peak
Bottlenecks in sequence:
1. API β order validation + DB write = synchronous path
2. Inventory check β external or internal service call
3. Payment processing β external, slow (200msβ2s)
4. Notifications β email/SMS β fire-and-forget
Architecture:
[Client]
β REST POST /orders (idempotency-key header)
βΌ
[API pods Γ 10] β stateless, horizontal
β validate, write to DB (command)
β publish to message queue
βΌ
[Message Queue] β RabbitMQ / Azure Service Bus
β
βββΊ [InventoryWorker pods Γ 5] β reserve stock
βββΊ [PaymentWorker pods Γ 5] β charge card, retry on failure
βββΊ [NotificationWorker Γ 3] β email/SMS (low priority)
Database strategy:
- Write: PostgreSQL primary (write path only)
- Read: PostgreSQL replica (order history queries)
- Connection pool: max 100 connections per API pod, 10 pods = 1000 max
- Alternative: CockroachDB or Cosmos DB for global distribution
Throughput math:
1,667 writes/sec β 1 DB primary can handle 5,000β10,000 simple writes/sec
β Single primary is fine; add sharding only if > 50k/sec sustainedSD3: Design a distributed lock in .NET.
Use case: prevent duplicate processing in a distributed system
(two workers picking up the same job simultaneously)
Option 1: Redis SETNX (SET if Not eXists) + expiry
- Simple, fast, widely used
- Risk: lock holder crashes β expiry prevents deadlock
- Risk: lock expires while holder is still working β two holders
Option 2: Redlock (multi-node Redis)
- Acquire lock on majority of N Redis nodes
- Safer for high-stakes operations
- More complex, still has edge cases under network partition
- Martin Kleppmann's critique: use ZooKeeper or etcd for true distributed locks
Option 3: Database advisory locks (PostgreSQL pg_try_advisory_lock)
- If you already have PostgreSQL β no extra infrastructure
- Tied to DB connection lifetime β lock released on connection close// Redis distributed lock with StackExchange.Redis
public class RedisDistributedLock(IConnectionMultiplexer redis)
{
public async Task<bool> TryAcquireAsync(string resource, TimeSpan ttl, string lockValue)
{
var db = redis.GetDatabase();
return await db.StringSetAsync(
$"lock:{resource}",
lockValue, // unique per holder (e.g., Guid)
ttl,
When.NotExists); // SETNX β only set if not already locked
}
public async Task ReleaseAsync(string resource, string lockValue)
{
// Lua: only delete if we own it (prevents releasing another holder's lock)
const string lua = """
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
else
return 0
end
""";
var db = redis.GetDatabase();
await db.ScriptEvaluateAsync(lua,
keys: [$"lock:{resource}"],
values: [lockValue]);
}
}
// Usage with automatic release
public async Task ProcessJobAsync(int jobId)
{
var lockValue = Guid.NewGuid().ToString();
var resource = $"job:{jobId}";
if (!await _lock.TryAcquireAsync(resource, TimeSpan.FromSeconds(30), lockValue))
{
logger.LogInformation("Job {JobId} already locked β skipping", jobId);
return;
}
try { await DoWorkAsync(jobId); }
finally { await _lock.ReleaseAsync(resource, lockValue); }
}SD4: Design an event sourcing system in .NET.
Event sourcing: store state as a sequence of events, not current state.
Traditional: Orders table has { Id, Status, Total } β current snapshot
Event sourced: EventStore has { OrderId, EventType, Payload, Timestamp }
Current state = replay all events for an OrderId
When to use event sourcing:
β Full audit trail is a business requirement (financial, healthcare)
β Time travel β reconstruct state at any point in history
β Event-driven architecture β events are already the core abstraction
β Debugging β replay events to reproduce bugs exactly
When NOT to use:
β Simple CRUD with no audit requirements β massive over-engineering
β Team unfamiliar with the pattern β steep learning curve
β High read throughput β queries require projection/read model updates// Event store β append-only
public interface IEventStore
{
Task AppendAsync(string streamId, IEnumerable<IDomainEvent> events, int expectedVersion, CancellationToken ct);
Task<IReadOnlyList<IDomainEvent>> LoadAsync(string streamId, CancellationToken ct);
}
// Aggregate rebuilt from events
public class Order
{
public int Id { get; private set; }
public string Status { get; private set; } = "";
private int _version = 0;
public static Order Rehydrate(IEnumerable<IDomainEvent> events)
{
var order = new Order();
foreach (var e in events)
order.Apply(e);
return order;
}
private void Apply(IDomainEvent e) => _ = e switch
{
OrderCreatedEvent created => Apply(created),
OrderPaidEvent paid => Apply(paid),
OrderShippedEvent shipped => Apply(shipped),
_ => this,
};
private Order Apply(OrderCreatedEvent e) { Id = e.OrderId; Status = "Pending"; _version++; return this; }
private Order Apply(OrderPaidEvent e) { Status = "Paid"; _version++; return this; }
private Order Apply(OrderShippedEvent e) { Status = "Shipped"; _version++; return this; }
}
// Read model (projection) β built asynchronously from events
public class OrderReadModelProjection
{
public async Task HandleAsync(OrderCreatedEvent e)
{
await _readDb.UpsertAsync(new OrderReadModel(e.OrderId, "Pending", e.CustomerId));
}
}Distributed Systems
DS1: What is the CAP theorem and how does it apply to .NET architecture decisions?
CAP theorem states a distributed system can guarantee at most two of three: Consistency, Availability, Partition tolerance. Since network partitions happen in any distributed system, the real choice is CP vs AP.
CP (Consistent + Partition-tolerant):
- On partition: refuse writes to stay consistent
- Examples: ZooKeeper, etcd, CockroachDB (strong consistency mode)
- .NET use case: distributed lock, leader election, configuration store
AP (Available + Partition-tolerant):
- On partition: continue accepting writes, risk inconsistency
- Examples: Cassandra, DynamoDB, Redis (asynchronous replication)
- .NET use case: shopping cart, session state, caching
In practice for a .NET microservice system:
- Order state: CP β you cannot oversell inventory (prefer consistency)
- Product catalogue: AP β a stale cache is fine; availability matters more
- User sessions: AP β better to show stale data than log the user outDS2: Explain the Saga pattern and when to use it over a distributed transaction.
Distributed transactions (2PC β Two-Phase Commit):
- All services must agree before any commit
- One slow/down service blocks all others
- Tight coupling β all services must support 2PC
- Almost never correct for microservices
Saga pattern:
- Each service executes a local transaction and publishes an event
- If a step fails, compensating transactions undo prior steps
- Two implementations:
Choreography (event-driven):
OrderService β publishes OrderCreated
InventoryService β consumes, reserves stock, publishes StockReserved
PaymentService β consumes, charges card, publishes PaymentProcessed
OrderService β consumes, marks order Confirmed
IF PaymentFailed β publishes PaymentFailed
InventoryService β consumes, releases reservation (compensating transaction)
Orchestration (central coordinator):
OrderOrchestrator sends commands to each service
Easier to track state, but orchestrator can become a bottleneck
When to use:
Saga: cross-service business workflows with eventual consistency tolerance
2PC: single-DB multi-table operations (use a DB transaction instead)DS3: How do you handle exactly-once message processing in .NET?
Message brokers guarantee at-least-once delivery β duplicates are inevitable.
Exactly-once is achieved by making consumers idempotent, not by the broker.
Strategy 1: Natural idempotency
UPSERT instead of INSERT β running twice produces same result
State machines β transitioning from PendingβPaid twice is a no-op
Strategy 2: Deduplication table
CREATE TABLE ProcessedMessages (MessageId UUID PRIMARY KEY, ProcessedAt TIMESTAMP)
Before processing: INSERT ... ON CONFLICT DO NOTHING β RETURNING
If 0 rows inserted: message already processed, skip
If 1 row inserted: first time, process and commit in same transaction
Strategy 3: Idempotency key (for API calls)
Client sends Idempotency-Key header
Server stores result by key for 24h
Duplicate request β return cached result without re-executing
The key insight: idempotency + deduplication together = exactly-once semantics
without any changes to the broker or protocolArchitecture Decisions
AD1: When would you choose a monolith over microservices?
Start with a monolith when:
- Team is small (< 10 engineers) β coordination overhead of microservices exceeds benefit
- Domain is not yet well understood β premature service boundaries are expensive to undo
- Time-to-market matters β one deployment, one repo, one test suite
- You don't have DevOps maturity β microservices require K8s, service mesh, distributed tracing
Move to microservices when:
- Different services have genuinely different scaling needs (payment vs. catalogue)
- Different teams own different services and need independent deploys
- A monolith is causing deployment bottlenecks (everyone blocked by one release)
- You have a clear bounded context that could be extracted without cross-cutting concerns
The trap: "distributed monolith" β microservices that are still tightly coupled
at the database or API level, with all the microservice complexity but none of the benefits
Practical path:
1. Modular monolith β strict module boundaries in one codebase
2. Extract services only when specific bounded contexts have demonstrated need
3. Never split just because it "feels" like microservicesAD2: How would you migrate a large .NET monolith to microservices without downtime?
The Strangler Fig pattern:
Phase 1 β Introduce a routing layer (YARP or Nginx):
All traffic still goes to the monolith
The proxy can re-route specific paths to new services as they're ready
Phase 2 β Extract one bounded context at a time:
Choose the context with the clearest boundaries and lowest coupling
Build the new service alongside the monolith (not instead of it)
Run both simultaneously, switch traffic via the proxy
Keep a kill switch β can re-route back to monolith immediately
Phase 3 β Event bridge for shared state:
New service and monolith may need to share data during transition
Change Data Capture (CDC) on the monolith DB publishes events
New service consumes events to build its own read model
Phase 4 β Remove monolith code once new service is stable:
Only delete code after 2+ weeks of stable traffic to the new service
Keep DB tables until all consumers have migrated
Timeline reality:
Each service extraction: 2β6 months for a well-scoped bounded context
Full migration of a large system: 2β4 years
Most teams stop at 3β5 services β that is fineAD3: You inherit a .NET system with severe performance problems. How do you approach it?
Never optimise without measuring.
Step 1: Measure (before touching code)
- APM tool: Application Insights, Datadog, or Jaeger traces
- Identify the slowest endpoints by p95/p99 latency
- Identify highest-frequency queries in the database (pg_stat_statements)
- Check memory allocation: dotnet-counters, EventPipe, PerfView
Step 2: Find the actual bottleneck (it's almost always one of these)
- N+1 queries β most common; fix with EF Core Include or a JOIN
- Missing database index β check query plans (EXPLAIN ANALYZE)
- Synchronous I/O blocking the thread pool β blocking .Result or .Wait()
- Memory pressure β large allocations, LOH, GC pauses
- Chatty service calls β N HTTP calls inside a loop
Step 3: Fix and verify
- Fix one thing at a time
- Re-measure after each fix β never assume a fix helped
- Load test with k6 or NBomber β prove the improvement holds under load
Step 4: Structural improvements (if needed)
- Add caching at the right layer (Redis for shared state, HybridCache for L1+L2)
- Add CQRS read models if the query access pattern differs from the write model
- Move expensive operations to background jobs (async + message queue)
- Consider read replicas if DB reads are the bottleneckTechnical Leadership
TL1: How do you decide when to use a third-party library vs. building in-house?
Default: use the library. Build in-house only when:
- The library does not exist for your specific need
- The library's API is fundamentally incompatible with your model
- Licensing prevents use (GPL in commercial software)
- The library is unmaintained and has unpatched CVEs
Evaluation criteria for any library:
- NuGet download count and GitHub stars (proxy for community support)
- Last commit date β unmaintained = future security debt
- Issue tracker β how are critical bugs handled?
- Breaking change policy β does the maintainer respect semver?
- Does it ship its own transitive dependencies? (DLL hell risk)
For .NET specifically:
ORM: EF Core (Microsoft-backed, excellent for most cases)
Dapper (high-performance, lightweight β not competing with EF Core)
Validation: FluentValidation (widely used, active)
Messaging: MassTransit (abstracts RabbitMQ/Azure SB), not raw client SDK
Mapping: Mapster (source-generated, fast) > AutoMapper (slow at runtime)
Testing: xUnit + NSubstitute + FluentAssertions (proven combination)
What not to use:
AutoMapper for complex mappings β use explicit mapping methods instead
MediatR for every use case β overkill for simple CRUD, right tool for CQRSTL2: How do you conduct an architecture review on a teammate's design?
Good architecture reviews are collaborative, not adversarial.
Framework: PRISM
P β Problem: Is the problem statement clear? Does the design solve the right problem?
R β Requirements: Does it meet functional and non-functional requirements?
(Latency, throughput, SLA, data retention, compliance)
I β Interfaces: Are service boundaries clean? Do APIs make sense to consumers?
S β Scalability: What is the bottleneck? How does it behave at 10Γ load?
M β Maintainability: Can the team operate and evolve this in 2 years?
Questions to always ask:
"What happens when [dependency X] is down?"
"What happens at 10Γ current load?"
"How does this get deployed with zero downtime?"
"How do we debug a problem in production?"
"What is the rollback plan if this goes wrong?"
What to avoid:
Bike-shedding on naming or style in a design review
Requiring your preferred pattern when the proposed one also works
Making it personal β critique the design, not the designerTL3: A junior developer on your team is committing large, hard-to-review PRs. How do you address it?
Never criticise someone's process without helping them understand why it matters
and giving them a specific, actionable alternative.
1. First: understand the root cause
- Are they intimidated by multiple small PRs?
- Are they unclear on what "one concern" means for a PR?
- Is the task itself poorly scoped (too large)?
2. Teach the principle, not just the rule
"A PR should answer one question: did this change make the system better in one specific way?
A 2,000-line PR makes that question unanswerable."
3. Pair on breaking it down
Work with them to identify the seams in their next feature:
- Schema migration β PR 1
- Repository layer β PR 2
- Application layer β PR 3
- API endpoint + tests β PR 4
4. Acknowledge the difficulty
"Breaking work into small, always-releasable units is a skill that takes time to develop.
It's one of the hardest things to learn as an engineer."
5. Set clear expectations going forward
"PRs over 400 lines need a decomposition plan in the description.
Let's agree on that as a team standard.".NET-Specific Deep Dives
DN1: How does the .NET garbage collector work and how do you tune it for a high-throughput API?
.NET GC generations:
Gen 0: short-lived objects β collected frequently, fast (< 1ms)
Gen 1: survived Gen 0 β medium lifetime
Gen 2: long-lived objects β collected infrequently, can pause 10s of ms
LOH: Large Object Heap β objects > 85KB β never compacted by default, causes fragmentation
For a high-throughput API:
Goal: minimise Gen 2 and LOH collections
Tuning options:
1. Server GC mode (default for ASP.NET Core β already on)
One heap per CPU core, concurrent GC, minimal pause times
2. Reduce allocations on hot paths
Use ArrayPool.Shared instead of new T[]
Use MemoryPool for byte buffers
Use Span for string parsing β no allocation
Use struct for short-lived data (avoid boxing)
3. Avoid large allocations
Objects > 85KB go to LOH β pre-allocate and reuse buffers
JSON serialisation of large responses: use Utf8JsonWriter with a rented array
4. GC server settings in runtimeconfig.json
"GCHeapHardLimitPercent": 75 -- stay under 75% of container memory
"GCConserveMemory": 5 -- 0β9 scale; higher = more aggressive GC
5. Profile first
dotnet-counters watch System.Runtime
Watch: gen-0-gc-count, gen-1-gc-count, gen-2-gc-count, loh-size
A spike in gen-2-gc-count with high latency = GC pauses are your bottleneck DN2: Explain how EF Core's change tracker works and when it causes problems.
EF Core tracks every loaded entity in a DictionaryDN3: How would you implement a multi-region .NET deployment with data sovereignty?
Data sovereignty: customer data must stay in a specific geography (EU, US, APAC)
Required by: GDPR (EU), data residency contracts, government mandates
Architecture:
Global tier (no PII):
Azure Front Door / Cloudflare β global load balancer
Routes requests to the correct regional deployment based on:
a) User's JWT tenant_id claim (maps to region in tenant store)
b) Subdomain (eu.api.example.com, us.api.example.com)
Regional tier (per region, e.g., EU):
.NET API pods in West Europe AKS cluster
PostgreSQL with Flexible Server in West Europe (no geo-replication for PII)
Redis Cache in West Europe
Azure Service Bus in West Europe
Tenant routing service (global, no PII):
Maps tenant_id β region
Cached heavily, small dataset
Implementation in .NET:
1. API receives request
2. JWT decoded β tenant_id extracted
3. TenantStore.GetRegionAsync(tenantId) β "eu" | "us" | "apac"
4. If request landed on wrong region: return 307 redirect to correct region endpoint
5. If correct region: process normally β all data stays local
Cross-region concerns:
- Analytics/reporting: aggregate anonymised data only β never PII
- Backups: encrypted, stored in same region (no cross-region replication of PII)
- Disaster recovery: within-region replica onlyInterview Answer Templates
The "Tell me about a technically complex system you designed" answer structure:
1. Context: What was the business problem? Why did it matter?
2. Constraints: What made it hard? (Scale, latency, team size, timeline, existing system)
3. Options considered: What alternatives did you evaluate? Why did you rule them out?
4. Decision: What did you choose and why? What were the acknowledged trade-offs?
5. Outcome: What was the result? What would you do differently?
Example structure (30β60 seconds):
"We needed to process 50,000 webhook deliveries per minute with at-least-once
guarantee and per-tenant isolation. The constraint was that we couldn't change
the database schema because 20 other services depended on it.
I considered three approaches: [list them briefly]. We ruled out approach A because
[specific reason]. Approach B was viable but added [trade-off].
We chose C β [specific design] β accepting [specific trade-off] because [business reason].
The result: [measurable outcome]. If I did it again, I would [specific improvement]."The "How do you approach a bug you can't reproduce" answer structure:
1. Gather evidence before touching code: logs, traces, metrics, error rates
2. Form a hypothesis based on the evidence β don't guess randomly
3. Instrument the code to gather more evidence if needed
4. Reproduce in a lower environment using production data shape (Testcontainers)
5. Fix the smallest possible change that addresses the root cause
6. Add a regression test before deploying
7. Monitor after deploy β confirm the metric changed
Saying "I would add more logging and traces first" scores higher than
"I would reproduce it locally" β principal engineers think observability-first.Enjoyed this article?
Explore the Backend Systems learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.