.NET & C# Development · Lesson 222 of 229
Distributed Consensus — Raft, Leader Election, and Coordination in .NET
Why Consensus Is Fundamentally Hard
Before studying algorithms, understand why the problem exists. Two scenarios illustrate the core difficulty.
The Two Generals Problem
Two armies must coordinate an attack. They communicate only by messengers who may be captured. General A sends a messenger: "Attack at dawn?" If the messenger arrives and General B replies "Yes", that reply messenger may also be captured. General A cannot know whether B received the original message. B cannot know whether A received the reply.
The result: neither general can ever be certain the other will act. Even an infinite number of message exchanges does not solve this. In networking terms, there is no way to achieve absolute certainty over an unreliable channel.
This is not a theoretical problem. It is why TCP handshakes require three steps (not two), and why distributed transactions are expensive: you cannot guarantee that every participant received the commit signal.
Split-Brain
Imagine a primary database and two replicas behind a load balancer. A network partition isolates the primary from the replicas. The replicas elect a new primary. Now you have two primaries — each accepting writes — with no way to merge the results cleanly.
This is split-brain, and it corrupts data. The solutions all involve some form of majority quorum: require acknowledgement from more than half the nodes before proceeding. If you cannot reach a majority, you stop.
The Raft Algorithm
Raft was designed specifically to be understandable. It achieves consensus through three well-defined sub-problems: leader election, log replication, and safety.
Nodes and Terms
Every Raft node is in one of three states: follower, candidate, or leader. Time is divided into terms — monotonically increasing integers. A new term begins whenever a leader election starts.
Term 1 Term 2 Term 3
[Leader: Node1] [Leader: Node2] [Leader: Node1]
followers election followers
(Node2,3) conflict (Node2,3)A node that does not hear from a leader within an election timeout (typically 150–300ms, randomised) promotes itself to candidate and starts a new term.
Leader Election
When a follower times out, it:
- Increments its current term.
- Transitions to candidate state.
- Votes for itself.
- Sends
RequestVoteRPCs to all other nodes.
A node grants its vote if: it hasn't voted in this term yet, and the candidate's log is at least as up-to-date as its own.
If the candidate receives votes from a majority (floor(N/2) + 1 nodes), it becomes leader. With 5 nodes, that means 3 votes. This majority requirement is what prevents split-brain: only one candidate can achieve majority in a given term.
The randomised election timeout is the practical trick that prevents all followers from becoming candidates simultaneously. With jitter, one node almost always fires first and wins before others wake up.
Log Replication
Once elected, the leader accepts client requests. Each request becomes a log entry — an ordered, append-only record with a term number and the command payload.
The leader sends AppendEntries RPCs to followers. When a majority acknowledges the entry, the leader commits it, applies it to the state machine, and responds to the client. Uncommitted entries that were sent to followers but not yet acknowledged by a majority are safe to discard — they have not been applied anywhere.
Leader log: [1:SetX=1] [2:SetY=2] [3:SetX=3] ← committed through index 3
Follower A: [1:SetX=1] [2:SetY=2] [3:SetX=3] ← in sync
Follower B: [1:SetX=1] [2:SetY=2] ← one entry behind, will be sent
Follower C: [1:SetX=1] ← two entries behind, will be caught upIf the leader crashes after committing entry 3 but before followers receive it, the next elected leader will re-send that entry. Raft guarantees: an entry committed by a majority is never lost.
Safety Guarantee
A candidate cannot win an election if its log is less up-to-date than the voter's log. "Up-to-date" is defined as: higher term wins; if same term, longer log wins. This prevents a node with stale data from becoming leader and overwriting committed entries.
Where .NET Developers Encounter Consensus
You do not typically implement Raft yourself, but you interact with systems that do.
etcd (Kubernetes)
Every Kubernetes deployment uses etcd as its store for cluster state. etcd implements Raft internally. When you create a Deployment, the API server writes to etcd, which replicates the write to a majority of etcd nodes before confirming. Controllers (like the Deployment controller) watch etcd for changes and reconcile actual state toward desired state.
If the etcd leader is unavailable, Kubernetes control-plane writes block. Worker nodes keep running their current workloads, but no new scheduling decisions are made. This is CP behaviour: consistency over availability.
Redis Sentinel
Redis Sentinel runs consensus among sentinel processes (not Redis data nodes) to decide when to trigger failover. Sentinels vote on whether the primary is genuinely down. A majority of sentinels must agree before promoting a replica. With 3 sentinels, 2 must agree — this prevents a single sentinel with a flawed network view from triggering unnecessary failover.
SQL Server Availability Groups
WSFC (Windows Server Failover Cluster) and SQL Server AGs use their own quorum model. An AG with 3 replicas and 1 file-share witness requires majority quorum. If a network partition separates 2 replicas from the primary and witness, those 2 replicas cannot form a majority and will not elect a new primary — preventing split-brain at the cost of availability.
Practical Pattern: Distributed Locks with Redlock
The most common consensus-adjacent pattern in .NET is the distributed lock. The naive implementation — SET lock:resource NX EX 30 to a single Redis — has a failure mode: if the Redis primary dies after granting the lock but before the replica replicates it, the replica promoted to primary will grant the same lock to another client.
The Redlock algorithm addresses this by requiring the lock to be acquired on a majority of independent Redis instances.
Algorithm Steps
Given N independent Redis instances (N=5 is recommended):
- Get the current time in milliseconds:
t1. - Attempt to acquire the lock on all N instances with a short per-instance timeout (e.g., 50ms). Use the same random token as the lock value.
- If acquired on a majority (3 of 5) and the total elapsed time is less than the lock TTL, the lock is valid.
- The effective lock validity is
TTL - (t2 - t1)— the remaining time after acquisition. - To release: send a Lua script to all instances that deletes the key only if its value matches your token.
C# Redlock Implementation
public sealed class RedlockAcquisition : IAsyncDisposable
{
private readonly IReadOnlyList<IDatabase> _databases;
private readonly string _resource;
private readonly string _token;
private bool _disposed;
private RedlockAcquisition(
IReadOnlyList<IDatabase> databases,
string resource,
string token)
{
_databases = databases;
_resource = resource;
_token = token;
}
// Lua script ensures atomic check-and-delete — prevents releasing a lock
// owned by another client if our TTL expired and they grabbed it.
private const string ReleaseLua = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
""";
public static async Task<RedlockAcquisition?> TryAcquireAsync(
IReadOnlyList<IDatabase> databases,
string resource,
TimeSpan ttl,
int retryCount = 3,
TimeSpan? retryDelay = null)
{
var quorum = databases.Count / 2 + 1;
var token = Guid.NewGuid().ToString("N");
var delay = retryDelay ?? TimeSpan.FromMilliseconds(200);
for (var attempt = 0; attempt < retryCount; attempt++)
{
var start = Environment.TickCount64;
var acquired = 0;
// Attempt on all instances concurrently with per-instance timeout
var tasks = databases.Select(db =>
TryAcquireOnInstanceAsync(db, resource, token, ttl));
var results = await Task.WhenAll(tasks);
acquired = results.Count(r => r);
var elapsed = TimeSpan.FromMilliseconds(Environment.TickCount64 - start);
var validity = ttl - elapsed;
if (acquired >= quorum && validity > TimeSpan.Zero)
{
return new RedlockAcquisition(databases, resource, token);
}
// Did not achieve quorum — release whatever we acquired
await ReleaseAllAsync(databases, resource, token);
if (attempt < retryCount - 1)
await Task.Delay(delay + TimeSpan.FromMilliseconds(Random.Shared.Next(50)));
}
return null; // Could not acquire lock
}
private static async Task<bool> TryAcquireOnInstanceAsync(
IDatabase db, string resource, string token, TimeSpan ttl)
{
try
{
// NX = only set if Not eXists, expiry = absolute TTL
return await db.StringSetAsync(
$"lock:{resource}",
token,
ttl,
When.NotExists);
}
catch
{
return false; // Instance unavailable — counts as failure
}
}
private static async Task ReleaseAllAsync(
IReadOnlyList<IDatabase> databases, string resource, string token)
{
var releaseKey = new RedisKey[] { $"lock:{resource}" };
var releaseArg = new RedisValue[] { token };
var tasks = databases.Select(db =>
db.ScriptEvaluateAsync(ReleaseLua, releaseKey, releaseArg).AsTask());
await Task.WhenAll(tasks);
}
public async ValueTask DisposeAsync()
{
if (_disposed) return;
_disposed = true;
await ReleaseAllAsync(_databases, _resource, _token);
}
}Usage in a .NET service:
public class PaymentProcessor(IReadOnlyList<IDatabase> redisInstances)
{
public async Task ProcessAsync(Guid orderId)
{
await using var @lock = await RedlockAcquisition.TryAcquireAsync(
redisInstances,
resource: $"order:{orderId}",
ttl: TimeSpan.FromSeconds(30));
if (@lock is null)
throw new InvalidOperationException("Could not acquire lock — another node is processing this order.");
// Safe to proceed — we hold the lock
await DoPaymentWorkAsync(orderId);
}
private Task DoPaymentWorkAsync(Guid orderId) => Task.CompletedTask;
}Fencing Tokens — Defending Against Stale Lock Holders
Redlock has a subtle failure mode: if a process acquires a lock, experiences a GC pause or thread starvation longer than the lock TTL, the lock expires, another process acquires it, and now both processes believe they hold the lock.
The solution is a fencing token: a monotonically increasing integer issued with the lock, sent to the storage backend with every write. The backend rejects writes with tokens lower than the last seen.
// Fencing token from a central counter (etcd, SQL SEQUENCE, or Redis INCR)
public class FencedInventoryUpdater(NpgsqlDataSource db, IDatabase redis)
{
public async Task UpdateStockAsync(Guid productId, int newQty)
{
// Acquire a monotonically increasing token
var token = await redis.StringIncrementAsync("fencing:inventory");
// The UPDATE only succeeds if the token column value is < our token.
// Any stale holder with a lower token will find its UPDATE rejected.
await using var conn = await db.OpenConnectionAsync();
await using var cmd = conn.CreateCommand();
cmd.CommandText = """
UPDATE stock_levels
SET quantity = @qty,
last_fence_token = @token
WHERE product_id = @productId
AND last_fence_token < @token
""";
cmd.Parameters.AddWithValue("productId", productId);
cmd.Parameters.AddWithValue("qty", newQty);
cmd.Parameters.AddWithValue("token", (long)token);
var rows = await cmd.ExecuteNonQueryAsync();
if (rows == 0)
throw new InvalidOperationException("Stale lock holder detected — write rejected by fencing token.");
}
}The database column last_fence_token stores the highest token that successfully wrote. Any process with a lower token is provably stale and its write is rejected, even if it believes it holds a valid lock.
Leader Election with IDistributedLock
The .NET ecosystem offers higher-level abstractions. DistributedLock (the NuGet package by @madelson) provides IDistributedLock with support for SQL Server, Redis, and file-system backends.
// Leader election for a background processor — only one instance runs at a time
public class SingletonBackgroundProcessor(IDistributedLockProvider lockProvider) : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
// TryAcquireAsync returns null immediately if another instance holds it
await using var handle = await lockProvider
.TryAcquireLockAsync("processor:leader", TimeSpan.FromMinutes(5), stoppingToken);
if (handle is not null)
{
// We are the leader — do work
await RunLeaderLoopAsync(stoppingToken);
}
else
{
// Another instance is leader — wait and retry
await Task.Delay(TimeSpan.FromSeconds(30), stoppingToken);
}
}
}
private async Task RunLeaderLoopAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
await ProcessNextBatchAsync(ct);
await Task.Delay(TimeSpan.FromSeconds(5), ct);
}
}
private Task ProcessNextBatchAsync(CancellationToken ct) => Task.CompletedTask;
}This pattern is common for cron-like jobs in a scaled-out application where you cannot run multiple instances simultaneously (e.g., a scheduled report generator, a database compaction job).
When NOT to Use Consensus
Consensus has a cost: at minimum one round-trip to a quorum of nodes, plus retry logic. For high-throughput operations this adds 5–50ms per operation. Avoid consensus when these alternatives apply.
Idempotent Operations + Deduplication
If an operation is safe to execute multiple times with the same result, you do not need a lock. Instead, assign each operation a unique client-generated ID and deduplicate at the receiver.
// No distributed lock needed — idempotent payment processing
public class IdempotentPaymentHandler(AppDbContext db)
{
public async Task<PaymentResult> HandleAsync(ProcessPaymentCommand cmd)
{
// Check if we've already processed this idempotency key
var existing = await db.PaymentResults
.FirstOrDefaultAsync(r => r.IdempotencyKey == cmd.IdempotencyKey);
if (existing is not null)
return existing; // Return previous result — safe to call again
// Process and store atomically
var result = await ChargeAsync(cmd);
db.PaymentResults.Add(result);
await db.SaveChangesAsync();
return result;
}
private Task<PaymentResult> ChargeAsync(ProcessPaymentCommand cmd)
=> Task.FromResult(new PaymentResult { IdempotencyKey = cmd.IdempotencyKey });
}The database unique constraint on IdempotencyKey is the concurrency guard. If two requests race, one will get a constraint violation — handle it by returning the existing result.
Compare-And-Swap (Optimistic Concurrency)
For resource updates, optimistic locking via a version column eliminates the need for a distributed lock:
// Optimistic concurrency — no distributed lock, retry on conflict
public async Task<bool> TryUpdateAsync(Guid id, decimal newValue, int expectedVersion)
{
await using var conn = await _db.OpenConnectionAsync();
await using var cmd = conn.CreateCommand();
cmd.CommandText = """
UPDATE accounts
SET balance = @value, version = version + 1
WHERE id = @id AND version = @expectedVersion
""";
cmd.Parameters.AddWithValue("id", id);
cmd.Parameters.AddWithValue("value", newValue);
cmd.Parameters.AddWithValue("expectedVersion", expectedVersion);
return await cmd.ExecuteNonQueryAsync() == 1;
}Only reach for distributed locks when:
- The operation involves external systems (payment gateways, email sends) that cannot be rolled back.
- You need to prevent thundering herd against a slow external resource.
- The critical section spans multiple storage systems with no atomic cross-system transaction.
Key Takeaways
- Consensus guarantees that at most one leader makes decisions at a time, preventing split-brain.
- Raft achieves this through majority quorum: a candidate needs floor(N/2)+1 votes; a log entry needs acknowledgement from floor(N/2)+1 nodes before commit.
- You interact with Raft through etcd (Kubernetes), Redis Sentinel, and SQL Server AGs — understanding the quorum model helps you predict failure behaviour.
- Redlock provides practical distributed locking across multiple Redis instances. Always pair it with fencing tokens when the downstream storage supports monotonic version checks.
- Consensus is expensive — prefer idempotent operations with deduplication, or optimistic concurrency with compare-and-swap, whenever the domain allows it.