.NET & C# Development · Lesson 205 of 229
System Design: Delivery Platform in .NET — Multi-Tenant Routing, SLA Monitoring, and Driver Assignment
System Design: Multi-Tenant Delivery Platform in .NET
Building a delivery platform is deceptively hard. At the surface it looks like: customer places order, driver picks it up, delivers it. Done. But the moment you add multiple tenants — each with their own restaurant fleet, SLA commitments, driver pools, and throughput characteristics — you have a distributed systems problem disguised as a logistics problem.
This case study walks through the full design of a multi-tenant delivery platform. We cover the order state machine, real-time driver tracking with geospatial data in Redis, route batching across nearby pickups, per-tenant isolation, and two production-grade challenges: tenant starvation and mid-delivery driver dropout. Every design decision is explained, not just described.
System Overview
The platform serves two kinds of tenants: restaurants (like Deliveroo partners) and retailers (like a grocery chain using the platform as white-label infrastructure). Each tenant has:
- A configured driver pool (dedicated drivers, or shared with spill-over)
- An SLA contract (e.g., "95% of orders delivered within 45 minutes")
- Rate limits on the API (to prevent runaway order floods from degrading neighbors)
- A geofenced operating region
The request lifecycle looks like this:
Customer App
│
▼
API Gateway (per-tenant routing, rate limiting)
│
▼
Order Service ──► OrderCreated event ──► Driver Assignment Service
│ │
▼ ▼
Order DB (EF Core) Redis GEO (driver locations)
│ │
▼ ▼
SLA Monitor Worker Route Batching Engine
│ │
▼ ▼
Escalation Events Driver App (push notification)Each component is deployed as a separate service but shares the same SQL database (per-tenant row isolation with EF Core global query filters). Redis handles all ephemeral, high-frequency data: driver GPS positions, feed queues, and session state.
Data Model
The core entities and their relationships:
// Tenant configuration — loaded at startup and cached
public class Tenant
{
public Guid Id { get; private set; }
public string Name { get; private set; } = default!;
public string Slug { get; private set; } = default!;
public TenantSlaConfig Sla { get; private set; } = default!;
public TenantDriverConfig DriverConfig { get; private set; } = default!;
public int ApiRateLimitPerMinute { get; private set; }
public GeoRegion OperatingRegion { get; private set; } = default!;
}
public record TenantSlaConfig(
int TargetDeliveryMinutes,
int EscalationThresholdPercent, // e.g. 80% of time elapsed → alert
int BreachPenaltyThresholdCount); // breach count before auto-incident
public record TenantDriverConfig(
int DedicatedPoolSize,
bool AllowSpillOver,
int SpillOverMaxDrivers);
// The core aggregate
public class Order
{
public Guid Id { get; private set; }
public Guid TenantId { get; private set; }
public Guid CustomerId { get; private set; }
public Guid? AssignedDriverId { get; private set; }
public OrderStatus Status { get; private set; }
public DateTimeOffset CreatedAt { get; private set; }
public DateTimeOffset? SlaDeadline { get; private set; }
public DateTimeOffset? PickedUpAt { get; private set; }
public DateTimeOffset? DeliveredAt { get; private set; }
public int ReassignmentCount { get; private set; }
public List<OrderItem> Items { get; private set; } = [];
public DeliveryAddress DropoffAddress { get; private set; } = default!;
public PickupAddress RestaurantAddress { get; private set; } = default!;
private readonly List<DomainEvent> _events = [];
public IReadOnlyList<DomainEvent> Events => _events.AsReadOnly();
// factory — SLA deadline set at creation time
public static Order Create(
Guid tenantId,
Guid customerId,
List<OrderItem> items,
PickupAddress pickup,
DeliveryAddress dropoff,
TenantSlaConfig sla)
{
var order = new Order
{
Id = Guid.NewGuid(),
TenantId = tenantId,
CustomerId = customerId,
Items = items,
RestaurantAddress = pickup,
DropoffAddress = dropoff,
Status = OrderStatus.Created,
CreatedAt = DateTimeOffset.UtcNow
};
order.SlaDeadline = order.CreatedAt.AddMinutes(sla.TargetDeliveryMinutes);
order._events.Add(new OrderCreatedEvent(order.Id, tenantId, order.SlaDeadline.Value));
return order;
}
}
public record DeliveryAddress(string Street, string City, double Latitude, double Longitude);
public record PickupAddress(Guid RestaurantId, string Street, double Latitude, double Longitude);
public enum OrderStatus
{
Created,
Confirmed,
PickupAssigned,
PickedUp,
InTransit,
Delivered,
Failed,
Cancelled
}Driver entity is lightweight — the GPS truth lives in Redis, not SQL:
public class Driver
{
public Guid Id { get; private set; }
public Guid TenantId { get; private set; } // home tenant
public string Name { get; private set; } = default!;
public bool IsActive { get; private set; }
public DriverCapacity Capacity { get; private set; } = default!;
}
public record DriverCapacity(int MaxConcurrentOrders, VehicleType VehicleType);
public enum VehicleType { Bicycle, Motorcycle, Car, Van }Key Design Decisions
1. Shared Database with Global Query Filters Over Schema-Per-Tenant
We evaluated three multi-tenancy models:
| Model | Isolation | Cost | Complexity | |---|---|---|---| | Database-per-tenant | Highest | Very high | Very high | | Schema-per-tenant | High | Medium | High | | Shared DB + row filter | Medium | Low | Low |
For this platform, tenants share infrastructure but have no need to see each other's data at all. Schema-per-tenant is operationally painful — running 200 tenants means 200 migration targets. We chose shared DB with EF Core global query filters. The TenantId column appears on every relevant entity, and the filter ensures it is always scoped to the current tenant context.
The ICurrentTenantAccessor is resolved from the HTTP context (extracted from the JWT claim) and injected into DbContext:
public class DeliveryDbContext : DbContext
{
private readonly ICurrentTenantAccessor _tenantAccessor;
public DeliveryDbContext(
DbContextOptions<DeliveryDbContext> options,
ICurrentTenantAccessor tenantAccessor) : base(options)
{
_tenantAccessor = tenantAccessor;
}
public DbSet<Order> Orders => Set<Order>();
public DbSet<Driver> Drivers => Set<Driver>();
public DbSet<Tenant> Tenants => Set<Tenant>();
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
// Global filter applied automatically to every query
modelBuilder.Entity<Order>()
.HasQueryFilter(o => o.TenantId == _tenantAccessor.TenantId);
modelBuilder.Entity<Driver>()
.HasQueryFilter(d => d.TenantId == _tenantAccessor.TenantId);
// Composite index for the most common query: active orders by tenant
modelBuilder.Entity<Order>()
.HasIndex(o => new { o.TenantId, o.Status, o.CreatedAt })
.HasDatabaseName("IX_Orders_Tenant_Status_Created");
// Partial index for in-flight orders only (status IN (1,2,3,4))
modelBuilder.Entity<Order>()
.HasIndex(o => new { o.TenantId, o.AssignedDriverId })
.HasDatabaseName("IX_Orders_Tenant_Driver")
.HasFilter("[Status] IN (2, 3, 4)"); // PickupAssigned, PickedUp, InTransit
}
}
// Accessor resolves tenant from ambient context
public interface ICurrentTenantAccessor
{
Guid TenantId { get; }
}
public class HttpContextTenantAccessor : ICurrentTenantAccessor
{
private readonly IHttpContextAccessor _httpContextAccessor;
public HttpContextTenantAccessor(IHttpContextAccessor httpContextAccessor)
{
_httpContextAccessor = httpContextAccessor;
}
public Guid TenantId
{
get
{
var claim = _httpContextAccessor.HttpContext?.User
.FindFirst("tenant_id")?.Value;
if (claim is null || !Guid.TryParse(claim, out var id))
throw new UnauthorizedAccessException("No tenant context on request.");
return id;
}
}
}The partial index on in-flight orders is important. The driver assignment query runs on every incoming order and only needs rows with active statuses. Without it, at 10M orders the index scan would still touch millions of delivered rows.
2. Order State Machine as a Closed Aggregate
The order lifecycle has 8 states. Any invalid transition must throw — not silently succeed. We model this directly in the aggregate rather than using a library like Stateless, because we want domain events fired on each transition and the logic is simple enough to own explicitly.
public class Order
{
// ... fields from data model above ...
public void Confirm()
{
EnsureStatus(OrderStatus.Created);
Status = OrderStatus.Confirmed;
_events.Add(new OrderConfirmedEvent(Id, TenantId));
}
public void AssignDriver(Guid driverId)
{
EnsureStatus(OrderStatus.Confirmed);
AssignedDriverId = driverId;
Status = OrderStatus.PickupAssigned;
_events.Add(new DriverAssignedEvent(Id, TenantId, driverId));
}
public void MarkPickedUp()
{
EnsureStatus(OrderStatus.PickupAssigned);
PickedUpAt = DateTimeOffset.UtcNow;
Status = OrderStatus.PickedUp;
_events.Add(new OrderPickedUpEvent(Id, TenantId, AssignedDriverId!.Value));
}
public void MarkInTransit()
{
EnsureStatus(OrderStatus.PickedUp);
Status = OrderStatus.InTransit;
}
public void MarkDelivered()
{
EnsureStatus(OrderStatus.InTransit);
DeliveredAt = DateTimeOffset.UtcNow;
Status = OrderStatus.Delivered;
_events.Add(new OrderDeliveredEvent(Id, TenantId, DeliveredAt.Value, SlaDeadline!.Value));
}
public void Cancel(string reason)
{
if (Status is OrderStatus.Delivered or OrderStatus.Failed or OrderStatus.Cancelled)
throw new InvalidOrderTransitionException(Id, Status, OrderStatus.Cancelled);
Status = OrderStatus.Cancelled;
_events.Add(new OrderCancelledEvent(Id, TenantId, reason));
}
public void MarkFailed(string reason)
{
Status = OrderStatus.Failed;
ReassignmentCount++;
_events.Add(new OrderFailedEvent(Id, TenantId, reason, ReassignmentCount));
}
public void Reassign(Guid newDriverId)
{
if (Status is not (OrderStatus.PickupAssigned or OrderStatus.PickedUp))
throw new InvalidOrderTransitionException(Id, Status, OrderStatus.PickupAssigned);
var previousDriver = AssignedDriverId;
AssignedDriverId = newDriverId;
ReassignmentCount++;
Status = OrderStatus.PickupAssigned;
_events.Add(new OrderReassignedEvent(Id, TenantId, previousDriver, newDriverId));
}
private void EnsureStatus(OrderStatus expected)
{
if (Status != expected)
throw new InvalidOrderTransitionException(Id, Status, expected);
}
}
public class InvalidOrderTransitionException : Exception
{
public InvalidOrderTransitionException(Guid orderId, OrderStatus current, OrderStatus attempted)
: base($"Order {orderId}: cannot transition from {current} to {attempted}.")
{
}
}Why not Stateless or another library? The state machine here has business logic in each transition (setting timestamps, incrementing counters, firing domain events). A library buys you a nice DSL but you still need to attach side effects to each trigger. The explicit if-guard pattern is more readable for the team and avoids a dependency. If the state count grew past 15 or the transitions became non-trivial (guards with multiple conditions), the library would justify itself.
3. Per-Tenant Rate Limiting at the API Gateway
Each tenant has an ApiRateLimitPerMinute configured. We use ASP.NET Core's built-in rate limiting middleware (introduced in .NET 7) with a fixed-window policy per tenant:
builder.Services.AddRateLimiter(options =>
{
options.AddPolicy("per-tenant", httpContext =>
{
var tenantId = httpContext.User.FindFirst("tenant_id")?.Value ?? "anonymous";
return RateLimitPartition.GetFixedWindowLimiter(
partitionKey: tenantId,
factory: _ =>
{
// Load tenant config — in practice, cache this
var tenantConfig = httpContext.RequestServices
.GetRequiredService<ITenantConfigCache>()
.GetConfig(Guid.Parse(tenantId));
return new FixedWindowRateLimiterOptions
{
PermitLimit = tenantConfig.ApiRateLimitPerMinute,
Window = TimeSpan.FromMinutes(1),
QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
QueueLimit = 20
};
});
});
options.OnRejected = async (context, cancellationToken) =>
{
context.HttpContext.Response.StatusCode = StatusCodes.Status429TooManyRequests;
context.HttpContext.Response.Headers.RetryAfter = "60";
await context.HttpContext.Response.WriteAsJsonAsync(
new ProblemDetails
{
Title = "Rate limit exceeded",
Status = 429,
Detail = "Your tenant has exceeded its configured request rate. Retry after 60 seconds."
}, cancellationToken);
};
});The ITenantConfigCache is an in-memory cache populated at startup and refreshed every 5 minutes. We explicitly do not hit the database on every rate-limit evaluation — that would add latency to the hot path.
4. Driver Location Tracking: Redis GEOADD with Per-Tenant Namespacing
Driver GPS updates arrive every 5 seconds from the driver mobile app. At 500 active drivers per tenant and 10 tenants, that is 1,000 writes/second — trivial for Redis but expensive for any relational DB.
We use Redis GEOADD to store driver positions. The key is namespaced per tenant to enforce isolation: drivers:{tenantId}. This means a geospatial query for one tenant never touches another tenant's driver data:
public class TenantDriverPool
{
private readonly IConnectionMultiplexer _redis;
public TenantDriverPool(IConnectionMultiplexer redis)
{
_redis = redis;
}
private static string GeoKey(Guid tenantId) => $"drivers:{tenantId}";
private static string ActiveKey(Guid tenantId) => $"drivers:active:{tenantId}";
// Called every 5 seconds by driver heartbeat
public async Task UpdateLocationAsync(
Guid tenantId,
Guid driverId,
double latitude,
double longitude,
CancellationToken ct = default)
{
var db = _redis.GetDatabase();
var batch = db.CreateBatch();
// Store GPS position
var geoTask = batch.GeoAddAsync(
GeoKey(tenantId),
new GeoEntry(longitude, latitude, driverId.ToString()));
// Refresh active set with 30-second TTL per driver entry
// (if driver goes offline, entry expires automatically)
var activeTask = batch.SortedSetAddAsync(
ActiveKey(tenantId),
driverId.ToString(),
DateTimeOffset.UtcNow.AddSeconds(30).ToUnixTimeSeconds());
batch.Execute();
await Task.WhenAll(geoTask, activeTask);
}
// Find available drivers within radius, sorted by distance
public async Task<List<NearbyDriver>> FindNearbyDriversAsync(
Guid tenantId,
double latitude,
double longitude,
double radiusKm,
int maxResults = 10)
{
var db = _redis.GetDatabase();
var now = DateTimeOffset.UtcNow.ToUnixTimeSeconds();
// Get active driver IDs (heartbeat within last 30s)
var activeDriverIds = await db.SortedSetRangeByScoreAsync(
ActiveKey(tenantId),
start: now,
stop: double.MaxValue);
if (activeDriverIds.Length == 0)
return [];
// Geo radius search
var geoResults = await db.GeoRadiusAsync(
GeoKey(tenantId),
longitude, latitude,
radiusKm,
GeoUnit.Kilometers,
maxResults,
Order.Ascending,
GeoRadiusOptions.WithCoordinates | GeoRadiusOptions.WithDistance);
var activeSet = activeDriverIds.Select(x => x.ToString()).ToHashSet();
return geoResults
.Where(r => activeSet.Contains(r.Member.ToString()))
.Select(r => new NearbyDriver(
DriverId: Guid.Parse(r.Member.ToString()),
DistanceKm: r.Distance ?? 0,
Latitude: r.Position?.Latitude ?? 0,
Longitude: r.Position?.Longitude ?? 0))
.ToList();
}
public async Task RemoveDriverAsync(Guid tenantId, Guid driverId)
{
var db = _redis.GetDatabase();
var batch = db.CreateBatch();
batch.GeoRemoveAsync(GeoKey(tenantId), driverId.ToString());
batch.SortedSetRemoveAsync(ActiveKey(tenantId), driverId.ToString());
batch.Execute();
}
}
public record NearbyDriver(Guid DriverId, double DistanceKm, double Latitude, double Longitude);The active set is a ZSET where the score is the expiry timestamp. We do not use Redis EXPIRE on individual geo entries because GEOADD targets a single key (the whole tenant set). Instead the active set acts as a filter layer: we query geo results then intersect with the active ZSET. Drivers who stop sending heartbeats fall out of the active set within 30 seconds automatically (we prune entries with score less than now before each query).
5. Route Batching: Multiple Orders, One Driver Trip
Batching reduces total distance driven and improves delivery economics. The algorithm groups nearby pickup addresses and assigns them to a single driver if the detour cost is within a threshold.
public class RouteBatchingEngine
{
private readonly TenantDriverPool _driverPool;
private const double MaxBatchDetourKm = 0.8; // max extra km per order added
private const int MaxOrdersPerBatch = 3;
public RouteBatchingEngine(TenantDriverPool driverPool)
{
_driverPool = driverPool;
}
public async Task<List<OrderBatch>> BuildBatchesAsync(
Guid tenantId,
List<PendingOrder> pendingOrders)
{
// Group orders by restaurant proximity (within 0.5km radius of each other)
var clusters = ClusterByRestaurantProximity(pendingOrders, radiusKm: 0.5);
var batches = new List<OrderBatch>();
foreach (var cluster in clusters)
{
// Limit batch size — too many pickups and the delivery window blows out
var batchOrders = cluster.Take(MaxOrdersPerBatch).ToList();
var centroid = ComputeCentroid(batchOrders);
var drivers = await _driverPool.FindNearbyDriversAsync(
tenantId,
centroid.Latitude,
centroid.Longitude,
radiusKm: 3.0,
maxResults: 5);
if (drivers.Count == 0)
{
// No nearby drivers — fall back to individual assignment
batches.AddRange(batchOrders.Select(o =>
new OrderBatch([o], AssignedDriver: null)));
continue;
}
var bestDriver = drivers.First(); // already sorted by distance
batches.Add(new OrderBatch(batchOrders, bestDriver));
}
return batches;
}
private List<List<PendingOrder>> ClusterByRestaurantProximity(
List<PendingOrder> orders,
double radiusKm)
{
var clusters = new List<List<PendingOrder>>();
var assigned = new HashSet<Guid>();
foreach (var order in orders)
{
if (assigned.Contains(order.OrderId)) continue;
var cluster = orders
.Where(o => !assigned.Contains(o.OrderId)
&& HaversineKm(
order.PickupLat, order.PickupLon,
o.PickupLat, o.PickupLon) <= radiusKm)
.ToList();
foreach (var o in cluster) assigned.Add(o.OrderId);
clusters.Add(cluster);
}
return clusters;
}
private static (double Latitude, double Longitude) ComputeCentroid(
List<PendingOrder> orders)
{
return (
orders.Average(o => o.PickupLat),
orders.Average(o => o.PickupLon));
}
private static double HaversineKm(double lat1, double lon1, double lat2, double lon2)
{
const double R = 6371.0;
var dLat = (lat2 - lat1) * Math.PI / 180;
var dLon = (lon2 - lon1) * Math.PI / 180;
var a = Math.Sin(dLat / 2) * Math.Sin(dLat / 2)
+ Math.Cos(lat1 * Math.PI / 180) * Math.Cos(lat2 * Math.PI / 180)
* Math.Sin(dLon / 2) * Math.Sin(dLon / 2);
return 2 * R * Math.Atan2(Math.Sqrt(a), Math.Sqrt(1 - a));
}
}
public record PendingOrder(Guid OrderId, double PickupLat, double PickupLon, DateTimeOffset SlaDeadline);
public record OrderBatch(List<PendingOrder> Orders, NearbyDriver? AssignedDriver);We deliberately cap batches at 3 orders. The math: a 3-order pickup route adds at most 1.6 km of detour (2 × 0.8 km threshold). At 25 km/h average speed, that is under 4 minutes of extra time. For a 45-minute SLA with a 30-minute in-hand target, this is acceptable. Four or more orders in a batch would exceed SLA on the last delivery too often.
6. SLA Monitoring Worker
The SLA worker runs as a BackgroundService using PeriodicTimer. Every 30 seconds it queries for orders approaching their deadline and fires escalation events:
public class SlaMonitorWorker : BackgroundService
{
private readonly IServiceScopeFactory _scopeFactory;
private readonly IMessageBus _bus;
private readonly ILogger<SlaMonitorWorker> _logger;
public SlaMonitorWorker(
IServiceScopeFactory scopeFactory,
IMessageBus bus,
ILogger<SlaMonitorWorker> logger)
{
_scopeFactory = scopeFactory;
_bus = bus;
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
using var timer = new PeriodicTimer(TimeSpan.FromSeconds(30));
while (await timer.WaitForNextTickAsync(stoppingToken))
{
try
{
await ScanForAtRiskOrdersAsync(stoppingToken);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
_logger.LogError(ex, "SLA monitor scan failed.");
}
}
}
private async Task ScanForAtRiskOrdersAsync(CancellationToken ct)
{
await using var scope = _scopeFactory.CreateAsyncScope();
var db = scope.ServiceProvider.GetRequiredService<DeliveryDbContext>();
var tenantCache = scope.ServiceProvider.GetRequiredService<ITenantConfigCache>();
var now = DateTimeOffset.UtcNow;
var activeStatuses = new[]
{
OrderStatus.Confirmed,
OrderStatus.PickupAssigned,
OrderStatus.PickedUp,
OrderStatus.InTransit
};
// Note: global query filter is NOT active in worker scope (no HTTP context).
// We query across all tenants intentionally here, scoping by status only.
var atRiskOrders = await db.Orders
.AsNoTracking()
.IgnoreQueryFilters() // worker queries across all tenants
.Where(o => activeStatuses.Contains(o.Status)
&& o.SlaDeadline.HasValue)
.Select(o => new
{
o.Id,
o.TenantId,
o.Status,
o.SlaDeadline,
o.AssignedDriverId
})
.ToListAsync(ct);
foreach (var order in atRiskOrders)
{
var tenantConfig = tenantCache.GetConfig(order.TenantId);
var totalMinutes = tenantConfig.Sla.TargetDeliveryMinutes;
var minutesRemaining = (order.SlaDeadline!.Value - now).TotalMinutes;
var percentElapsed = 1.0 - (minutesRemaining / totalMinutes);
var threshold = tenantConfig.Sla.EscalationThresholdPercent / 100.0;
if (minutesRemaining < 0)
{
await _bus.PublishAsync(new SlaBreachedEvent(
order.Id, order.TenantId, order.Status, now), ct);
}
else if (percentElapsed >= threshold)
{
await _bus.PublishAsync(new SlaAtRiskEvent(
order.Id, order.TenantId, minutesRemaining, order.AssignedDriverId), ct);
}
}
}
}The critical detail is IgnoreQueryFilters(). The worker has no HTTP context so ICurrentTenantAccessor would throw. We explicitly opt out of the filter and take responsibility for querying all tenants. This is the one place where cross-tenant access is intentional and safe.
Challenges and How We Solved Them
Challenge 1: Tenant A's High Volume Starving Tenant B's Drivers
The problem. Tenant A (a large grocery chain) sends 400 orders/hour during peak. Tenant B (a small restaurant) sends 20 orders/hour. Both share the same unpartitioned driver pool. The assignment algorithm, working first-come-first-served, routes all 400 of Tenant A's orders first. Tenant B's 20 orders wait in the queue and breach SLA.
Root cause. No per-tenant driver partitioning. All drivers are in a single Redis GEO key drivers:all and the FIFO assignment queue does not respect tenant priority or fairness.
Solution: per-tenant driver pools with configurable spill-over.
Each tenant's TenantDriverConfig specifies:
DedicatedPoolSize: how many drivers are reserved exclusively for this tenantAllowSpillOver: whether the tenant can borrow from a shared overflow poolSpillOverMaxDrivers: cap on borrowed drivers at any moment
We implement a PartitionedDriverAssignmentService that first tries the tenant's dedicated pool, then spills over to shared:
public class PartitionedDriverAssignmentService
{
private readonly TenantDriverPool _driverPool;
private readonly IConnectionMultiplexer _redis;
private readonly ITenantConfigCache _tenantConfig;
// Shared overflow pool key — drivers who opt in to covering any tenant
private const string SharedOverflowGeoKey = "drivers:shared-overflow";
private const string SharedOverflowActiveKey = "drivers:shared-overflow:active";
public PartitionedDriverAssignmentService(
TenantDriverPool driverPool,
IConnectionMultiplexer redis,
ITenantConfigCache tenantConfig)
{
_driverPool = driverPool;
_redis = redis;
_tenantConfig = tenantConfig;
}
public async Task<NearbyDriver?> AssignDriverAsync(
Guid tenantId,
double pickupLat,
double pickupLon,
CancellationToken ct = default)
{
var config = _tenantConfig.GetConfig(tenantId);
// 1. Try dedicated tenant pool first
var dedicated = await _driverPool.FindNearbyDriversAsync(
tenantId, pickupLat, pickupLon, radiusKm: 5.0, maxResults: 3);
if (dedicated.Count > 0)
return dedicated.First();
// 2. No dedicated driver available — try shared overflow if allowed
if (!config.DriverConfig.AllowSpillOver)
return null;
var currentSpillOver = await GetCurrentSpillOverCountAsync(tenantId);
if (currentSpillOver >= config.DriverConfig.SpillOverMaxDrivers)
{
// At cap — cannot borrow more shared drivers
return null;
}
var shared = await FindInSharedOverflowAsync(pickupLat, pickupLon, radiusKm: 5.0);
if (shared is null) return null;
// Track this spill-over assignment
await RecordSpillOverAsync(tenantId, shared.DriverId);
return shared;
}
private async Task<int> GetCurrentSpillOverCountAsync(Guid tenantId)
{
var db = _redis.GetDatabase();
var key = $"spill-over:count:{tenantId}";
var val = await db.StringGetAsync(key);
return val.TryParse(out int count) ? count : 0;
}
private async Task RecordSpillOverAsync(Guid tenantId, Guid driverId)
{
var db = _redis.GetDatabase();
var countKey = $"spill-over:count:{tenantId}";
var memberKey = $"spill-over:drivers:{tenantId}";
await db.StringIncrementAsync(countKey);
await db.SetAddAsync(memberKey, driverId.ToString());
// Auto-expire spill-over tracking after 2 hours
await db.KeyExpireAsync(countKey, TimeSpan.FromHours(2));
await db.KeyExpireAsync(memberKey, TimeSpan.FromHours(2));
}
private async Task<NearbyDriver?> FindInSharedOverflowAsync(
double lat, double lon, double radiusKm)
{
var db = _redis.GetDatabase();
var now = DateTimeOffset.UtcNow.ToUnixTimeSeconds();
var activeOverflow = await db.SortedSetRangeByScoreAsync(
SharedOverflowActiveKey, start: now, stop: double.MaxValue);
if (activeOverflow.Length == 0) return null;
var results = await db.GeoRadiusAsync(
SharedOverflowGeoKey, lon, lat, radiusKm,
GeoUnit.Kilometers, 3, Order.Ascending,
GeoRadiusOptions.WithDistance);
var activeSet = activeOverflow.Select(x => x.ToString()).ToHashSet();
return results
.Where(r => activeSet.Contains(r.Member.ToString()))
.Select(r => new NearbyDriver(
Guid.Parse(r.Member.ToString()),
r.Distance ?? 0, lat, lon))
.FirstOrDefault();
}
}The result: Tenant B always gets first pick of its dedicated pool. Tenant A's volume never touches Tenant B's reserved drivers. During quieter periods Tenant A can borrow from the shared overflow pool up to its configured cap.
Challenge 2: Driver Going Offline Mid-Delivery — Reassignment Saga with Timeout
The problem. A driver picks up an order and their phone dies (or they abandon). The order is in PickedUp or InTransit state with no status update for 10+ minutes. Without a timeout, the order just sits and the customer is never served.
Solution: OrderReassignmentSaga with a timeout trigger.
When a driver is assigned, we publish a DriverHeartbeatMonitor message with a delay of 10 minutes. If the driver sends a heartbeat before then, the message is cancelled. If not, the saga fires:
public class OrderReassignmentSaga : ISaga,
IAmStartedByMessages<DriverAssignedEvent>,
IHandleMessages<DriverHeartbeatReceived>,
IHandleMessages<OrderDeliveredEvent>,
IHandleMessages<OrderCancelledEvent>,
IHandleTimeouts<ReassignmentTimeoutExpired>
{
public Guid CorrelationId { get; set; }
public Guid OrderId { get; set; }
public Guid TenantId { get; set; }
public Guid CurrentDriverId { get; set; }
public int ReassignmentAttempts { get; set; }
private const int MaxReassignmentAttempts = 3;
public async Task Handle(DriverAssignedEvent message, IMessageHandlerContext context)
{
OrderId = message.OrderId;
TenantId = message.TenantId;
CurrentDriverId = message.DriverId;
ReassignmentAttempts = 0;
// Start a 10-minute watchdog
await RequestTimeout<ReassignmentTimeoutExpired>(
context,
TimeSpan.FromMinutes(10),
new ReassignmentTimeoutExpired(OrderId, CurrentDriverId));
}
public async Task Handle(DriverHeartbeatReceived message, IMessageHandlerContext context)
{
if (message.DriverId != CurrentDriverId) return;
// Driver is alive — reset the watchdog
await RequestTimeout<ReassignmentTimeoutExpired>(
context,
TimeSpan.FromMinutes(10),
new ReassignmentTimeoutExpired(OrderId, CurrentDriverId));
}
public async Task Timeout(ReassignmentTimeoutExpired state, IMessageHandlerContext context)
{
if (state.DriverId != CurrentDriverId)
return; // Stale timeout from a previous driver — ignore
if (ReassignmentAttempts >= MaxReassignmentAttempts)
{
await context.Publish(new OrderEscalatedToOpsEvent(OrderId, TenantId,
$"Driver {CurrentDriverId} unresponsive after {MaxReassignmentAttempts} reassignment attempts."));
MarkAsComplete();
return;
}
ReassignmentAttempts++;
await context.Publish(new ReassignDriverCommand(
OrderId, TenantId, CurrentDriverId, ReassignmentAttempts));
// Another 10-minute window for the new driver
await RequestTimeout<ReassignmentTimeoutExpired>(
context,
TimeSpan.FromMinutes(10),
new ReassignmentTimeoutExpired(OrderId, null));
}
public Task Handle(OrderDeliveredEvent message, IMessageHandlerContext context)
{
MarkAsComplete();
return Task.CompletedTask;
}
public Task Handle(OrderCancelledEvent message, IMessageHandlerContext context)
{
MarkAsComplete();
return Task.CompletedTask;
}
}
public record ReassignmentTimeoutExpired(Guid OrderId, Guid? DriverId);
public record ReassignDriverCommand(Guid OrderId, Guid TenantId, Guid PreviousDriverId, int Attempt);
public record OrderEscalatedToOpsEvent(Guid OrderId, Guid TenantId, string Reason);After 3 failed reassignment attempts the saga escalates to operations — a human looks at it. This handles the edge case where no driver in the area is available (e.g., severe weather, area outage). The saga uses NServiceBus timeout infrastructure which persists the timeout to a durable store, surviving service restarts.
.NET Implementation Patterns
Registering Everything
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddDbContext<DeliveryDbContext>((sp, options) =>
{
options.UseSqlServer(builder.Configuration.GetConnectionString("Delivery"));
});
builder.Services.AddHttpContextAccessor();
builder.Services.AddScoped<ICurrentTenantAccessor, HttpContextTenantAccessor>();
builder.Services.AddSingleton<ITenantConfigCache, TenantConfigCache>();
// Redis
var redisConn = await ConnectionMultiplexer.ConnectAsync(
builder.Configuration.GetConnectionString("Redis")!);
builder.Services.AddSingleton<IConnectionMultiplexer>(redisConn);
builder.Services.AddSingleton<TenantDriverPool>();
builder.Services.AddSingleton<PartitionedDriverAssignmentService>();
builder.Services.AddScoped<RouteBatchingEngine>();
// Background workers
builder.Services.AddHostedService<SlaMonitorWorker>();
builder.Services.AddRateLimiter(/* ... per-tenant config above ... */);
var app = builder.Build();
app.UseRateLimiter();
app.MapControllers().RequireRateLimiting("per-tenant");Minimal API: Placing an Order
app.MapPost("/orders", async (
CreateOrderRequest request,
ICurrentTenantAccessor tenantAccessor,
ITenantConfigCache tenantConfig,
DeliveryDbContext db,
IMessageBus bus,
CancellationToken ct) =>
{
var tenantId = tenantAccessor.TenantId;
var slaConfig = tenantConfig.GetConfig(tenantId).Sla;
var order = Order.Create(
tenantId,
request.CustomerId,
request.Items.Select(i => new OrderItem(i.MenuItemId, i.Quantity, i.UnitPrice)).ToList(),
new PickupAddress(request.RestaurantId, request.PickupStreet,
request.PickupLat, request.PickupLon),
new DeliveryAddress(request.DropoffStreet, request.DropoffCity,
request.DropoffLat, request.DropoffLon),
slaConfig);
db.Orders.Add(order);
await db.SaveChangesAsync(ct);
foreach (var evt in order.Events)
await bus.PublishAsync(evt, ct);
return Results.Created($"/orders/{order.Id}", new { order.Id, order.SlaDeadline });
})
.RequireAuthorization();What We'd Do Differently
1. Separate write and read models from day one. The Order entity in EF Core is the write model. But order status queries (dashboards, driver apps) hit it constantly with joins and projections. We should have introduced a dedicated read-side OrderSummary table (denormalized, updated via domain events) on day one instead of retrofitting it at month 6 when query times degraded.
2. Use a dedicated time-series store for driver GPS history. We use Redis GEO for current position, which is correct. But we log every GPS update to SQL for compliance/audit. At 1,000 writes/second over 30 days that is 2.6 billion rows. A time-series database like TimescaleDB or InfluxDB would be dramatically cheaper to query and maintain.
3. Make SlaDeadline a domain concept, not a calculated field. Currently SlaDeadline is set at creation as CreatedAt + SLA minutes. In production, SLA commitments vary by time of day (peak hours extend the window), by weather, and by order complexity. The deadline should be calculated by a dedicated SlaCalculationService with pluggable rules, not hardcoded in the aggregate factory.
4. Better observability on tenant fairness. We added per-tenant spill-over tracking in Redis but did not expose it to Grafana until a P1 incident. Every tenant should have a live dashboard showing: dedicated pool utilization, current spill-over count, orders at risk, and p95 delivery time. Instrument first, debug later.
5. Idempotency on driver assignment. If the assignment service crashes after writing to the DB but before publishing the DriverAssignedEvent, the saga never starts and the driver is never notified. We should use the transactional outbox pattern here: write the order and the event to the DB in the same transaction, and let a poller publish the event. We added this at month 3 after seeing exactly this failure in production.