.NET & C# Development · Lesson 226 of 229
SLOs, SLIs, and Error Budgets — SRE Principles for .NET Services
Why SRE Concepts Matter for .NET Teams
Most .NET teams measure uptime with a dashboard and call it "monitoring." SRE goes further: you define precise numeric targets for reliability, measure them continuously, and use the gap between target and reality — the error budget — as the input to every deployment and feature decision.
When your error budget is healthy, ship fast. When it is burning, freeze deployments and fix reliability. This is not just philosophy; it is a feedback loop with teeth.
This article walks through the full lifecycle:
- Define SLIs that are worth measuring for .NET APIs
- Set SLOs that reflect real user expectations
- Calculate error budgets from first principles
- Implement the measurement layer with OpenTelemetry in C#
- Wire burn-rate alerts so you page before the budget expires
The Vocabulary: SLI, SLO, SLA, Error Budget
| Term | Definition | Example |
|---|---|---|
| SLI | A quantitative measure of a service dimension | 99th-percentile latency of POST /orders |
| SLO | The target value for an SLI over a rolling window | p99 latency ≤ 500 ms, measured over 30 days |
| SLA | A contractual promise (often external, with penalties) | "99.9% availability or we credit your account" |
| Error Budget | 100% − SLO — how much failure you are allowed | SLO 99.9% → budget = 0.1% of all requests |
The SLI is the ruler. The SLO is the line on the ruler. The error budget is the distance remaining to the line.
Choosing Good SLIs for .NET APIs
Not all metrics make good SLIs. A good SLI is:
- User-visible — something users actually feel
- Measurable at high fidelity — not sampled, not estimated
- Actionable — when it degrades, your team knows what to do
For a typical ASP.NET Core REST API, the canonical SLI set is:
1. Request Latency (p50 / p95 / p99)
Measures the distribution of response times. Users feel p99, not averages. Define one SLO per critical endpoint or endpoint group.
SLI: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
SLO: ≤ 500ms over a 30-day rolling window2. Availability (Success Rate)
The fraction of requests that succeed (HTTP 2xx or 3xx, excluding health checks and expected 4xx client errors).
SLI: sum(rate(http_requests_total{status=~"2..|3.."}[5m])) /
sum(rate(http_requests_total[5m]))
SLO: ≥ 99.9% over a 30-day rolling window3. Throughput
Requests per second sustained without latency degradation. Useful when you have contractual throughput guarantees.
4. Error Rate
The fraction of requests returning 5xx. Related to availability but focuses only on server-side faults, not redirects.
Do not use: CPU percent, memory percent, or queue depth as SLIs. These are causes of degradation, not measures of the user experience. They belong in runbooks, not SLOs.
Setting Realistic SLOs — The Downtime Math
The choice between 99.9% and 99.99% is not just a digit. It has operational implications:
| SLO | Allowed downtime per month | Allowed downtime per year | |---|---|---| | 99.0% | 7 h 18 min | 3 d 15 h | | 99.5% | 3 h 39 min | 1 d 20 h | | 99.9% | 43 min 49 sec | 8 h 46 min | | 99.95% | 21 min 54 sec | 4 h 22 min | | 99.99% | 4 min 22 sec | 52 min |
A 99.99% SLO means you have 4 minutes and 22 seconds of allowed failure per month. A single slow deployment that takes 6 minutes to drain connections will blow that budget.
Practical guidance:
- Start at 99.5% for internal services. Tighten when you have actual data.
- 99.9% is achievable with a single-region deployment and good health checks.
- 99.99% requires active-active multi-region, canary deployments, and significant operational investment.
- Never set an SLO tighter than your historical performance without infrastructure changes. You will instantly be in violation.
Error Budget Math
Given an SLO of 99.9% over a 30-day window:
Total minutes in 30 days = 30 × 24 × 60 = 43,200 minutes
Error budget = (1 − 0.999) × 43,200 = 43.2 minutesFor a request-based SLO (preferred because it weights by traffic, not calendar time):
Assume 10,000,000 requests per 30 days
Error budget = (1 − 0.999) × 10,000,000 = 10,000 bad requests allowedEvery 5xx response, every timeout, every p99 breach counts against this pool. When the pool is empty, you are in SLO violation.
Implementing SLIs in .NET with OpenTelemetry
Install the packages:
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Exporter.Prometheus.AspNetCore
dotnet add package System.Diagnostics.DiagnosticSourceThe LatencySliRecorder Middleware
This middleware records every HTTP request into a histogram that Prometheus can scrape. It labels by route template (not raw path — that would create cardinality explosion with IDs in URLs) and by status code range.
using System.Diagnostics;
using System.Diagnostics.Metrics;
namespace YourApp.Observability;
/// <summary>
/// Records HTTP request latency into an OpenTelemetry histogram.
/// Designed to feed SLI/SLO dashboards in Prometheus + Grafana.
/// </summary>
public sealed class LatencySliRecorder : IDisposable
{
private readonly Meter _meter;
private readonly Histogram<double> _latencyHistogram;
private readonly Counter<long> _requestCounter;
private readonly Counter<long> _errorCounter;
public LatencySliRecorder(string serviceName)
{
_meter = new Meter(serviceName, "1.0.0");
// Buckets tuned for web API latency: 10ms .. 10s
_latencyHistogram = _meter.CreateHistogram<double>(
name: "http_request_duration_seconds",
unit: "s",
description: "HTTP request latency — feeds p99 SLI");
_requestCounter = _meter.CreateCounter<long>(
name: "http_requests_total",
description: "Total HTTP requests — feeds availability SLI");
_errorCounter = _meter.CreateCounter<long>(
name: "http_requests_errors_total",
description: "HTTP requests returning 5xx — feeds error rate SLI");
}
public void Record(string method, string routeTemplate, int statusCode, double durationSeconds)
{
var tags = new TagList
{
{ "method", method },
{ "route", routeTemplate }, // e.g. "/orders/{id}" not "/orders/42"
{ "status", statusCode.ToString() },
{ "status_class", StatusClass(statusCode) } // "2xx", "4xx", "5xx"
};
_latencyHistogram.Record(durationSeconds, tags);
_requestCounter.Add(1, tags);
if (statusCode >= 500)
_errorCounter.Add(1, tags);
}
private static string StatusClass(int code) => code switch
{
>= 500 => "5xx",
>= 400 => "4xx",
>= 300 => "3xx",
>= 200 => "2xx",
_ => "1xx"
};
public void Dispose() => _meter.Dispose();
}Wire It in Middleware
namespace YourApp.Observability;
public sealed class SliMiddleware(RequestDelegate next, LatencySliRecorder recorder)
{
public async Task InvokeAsync(HttpContext ctx)
{
var sw = Stopwatch.StartNew();
try
{
await next(ctx);
}
finally
{
sw.Stop();
// RoutePattern gives "/orders/{id}" — prevents high-cardinality label explosion
var route = ctx.GetEndpoint()?.Metadata
.GetMetadata<RouteNameMetadata>()?.RouteName
?? ctx.Request.Path.Value
?? "unknown";
recorder.Record(
method: ctx.Request.Method,
routeTemplate: route,
statusCode: ctx.Response.StatusCode,
durationSeconds: sw.Elapsed.TotalSeconds);
}
}
}Register in Program.cs
// Program.cs
builder.Services.AddSingleton(new LatencySliRecorder(builder.Environment.ApplicationName));
builder.Services.AddOpenTelemetry()
.WithMetrics(metrics => metrics
.AddMeter(builder.Environment.ApplicationName)
.AddPrometheusExporter());
var app = builder.Build();
app.UseMiddleware<SliMiddleware>();
// Prometheus scrape endpoint
app.MapPrometheusScrapingEndpoint("/metrics");Error Budget Monitor — Querying Remaining Budget
In practice you query Prometheus (or Azure Monitor) for the current SLO compliance, then compute remaining budget. This class shows the pattern.
namespace YourApp.Observability;
/// <summary>
/// Queries Prometheus to compute remaining error budget for a given SLO.
/// Call this from a background service or health check endpoint.
/// </summary>
public sealed class ErrorBudgetMonitor(HttpClient prometheusClient, ILogger<ErrorBudgetMonitor> logger)
{
private const string GoodRequestsQuery =
@"sum(increase(http_requests_total{status_class=""2xx""}[30d]))";
private const string TotalRequestsQuery =
@"sum(increase(http_requests_total[30d]))";
public async Task<ErrorBudgetStatus> GetStatusAsync(double sloTarget, CancellationToken ct = default)
{
var good = await QueryScalarAsync(GoodRequestsQuery, ct);
var total = await QueryScalarAsync(TotalRequestsQuery, ct);
if (total == 0)
return ErrorBudgetStatus.Unavailable;
var currentSli = good / total;
var budgetTotal = (1.0 - sloTarget) * total;
var budgetUsed = Math.Max(0, total - good - (1.0 - sloTarget) * total * 0);
var badRequests = total - good;
var budgetAllowed = (1.0 - sloTarget) * total;
var budgetRemaining = budgetAllowed - badRequests;
logger.LogInformation(
"Error budget: SLO={Slo:P2}, SLI={Sli:P4}, budget remaining={Remaining:N0} requests",
sloTarget, currentSli, budgetRemaining);
return new ErrorBudgetStatus(
SloTarget: sloTarget,
CurrentSli: currentSli,
TotalRequests: (long)total,
BadRequests: (long)badRequests,
BudgetAllowed: (long)budgetAllowed,
BudgetRemaining: (long)budgetRemaining,
BudgetPercentLeft: budgetRemaining / budgetAllowed * 100);
}
private async Task<double> QueryScalarAsync(string promql, CancellationToken ct)
{
var url = $"/api/v1/query?query={Uri.EscapeDataString(promql)}";
var response = await prometheusClient.GetFromJsonAsync<PrometheusResponse>(url, ct);
var value = response?.Data?.Result?.FirstOrDefault()?.Value?[1]?.ToString();
return double.TryParse(value, out var d) ? d : 0;
}
}
public record ErrorBudgetStatus(
double SloTarget,
double CurrentSli,
long TotalRequests,
long BadRequests,
long BudgetAllowed,
long BudgetRemaining,
double BudgetPercentLeft)
{
public static ErrorBudgetStatus Unavailable => new(0, 0, 0, 0, 0, 0, 0);
public bool IsHealthy => BudgetRemaining > 0;
public bool IsCritical => BudgetPercentLeft < 10;
}
// Minimal Prometheus response shape
public record PrometheusResponse(PrometheusData? Data);
public record PrometheusData(List<PrometheusResult>? Result);
public record PrometheusResult(object[]? Value);Burn Rate Alerts — The Google SRE Formula
A burn rate of 1x means you are consuming the error budget at exactly the rate that will exhaust it at the end of the window. A burn rate of 14x means you will exhaust the entire monthly budget in roughly 2 days (30 days / 14 ≈ 2.1 days).
The Google SRE book defines a tiered alerting strategy:
| Severity | Burn Rate | Response Window | Action | |---|---|---|---| | Page | ≥ 14x over 1 h | 2 days remaining | Wake on-call immediately | | Page | ≥ 6x over 6 h | 5 days remaining | Page, not emergency | | Ticket | ≥ 3x over 3 d | 10 days remaining | Create ticket, fix this week | | Watch | ≥ 1x over 30 d | Budget expiring | Monitor closely |
The burn rate for a given window is:
burn_rate = (1 - SLI_over_window) / (1 - SLO_target)If SLO is 99.9% and SLI over the last hour is 99.0%:
burn_rate = (1 - 0.990) / (1 - 0.999) = 0.010 / 0.001 = 10xPrometheus Alert YAML
# slo-alerts.yaml — apply with kubectl or load into your Alertmanager config
groups:
- name: dotnet-api-slo
interval: 1m
rules:
# ── Fast-burn: page immediately ──────────────────────────────────────
- alert: ErrorBudgetFastBurn
expr: |
(
1 - (
sum(rate(http_requests_total{status_class="2xx"}[1h]))
/
sum(rate(http_requests_total[1h]))
)
)
/
(1 - 0.999) # ← replace 0.999 with your SLO target
> 14
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "Error budget burning at >14x — {{ $value | humanize }}x burn rate"
description: >
The 1-hour error rate implies the 30-day error budget will be
exhausted in {{ printf "%.1f" (div 43200.0 $value) | humanize }} minutes.
Immediately investigate recent deployments and check Grafana SLO dashboard.
runbook: "https://wiki.internal/runbooks/slo-fast-burn"
# ── Slow-burn: page, not emergency ──────────────────────────────────
- alert: ErrorBudgetSlowBurn
expr: |
(
1 - (
sum(rate(http_requests_total{status_class="2xx"}[6h]))
/
sum(rate(http_requests_total[6h]))
)
)
/
(1 - 0.999)
> 6
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "Error budget burning at >6x — {{ $value | humanize }}x burn rate"
runbook: "https://wiki.internal/runbooks/slo-slow-burn"
# ── Ticket: fix this week ────────────────────────────────────────────
- alert: ErrorBudgetTicket
expr: |
(
1 - (
sum(rate(http_requests_total{status_class="2xx"}[3d]))
/
sum(rate(http_requests_total[3d]))
)
)
/
(1 - 0.999)
> 3
for: 1h
labels:
severity: info
team: platform
annotations:
summary: "Error budget burn rate >3x — create reliability ticket"
runbook: "https://wiki.internal/runbooks/slo-ticket"Wiring a Health Check that Exposes Budget Status
Expose error budget state through ASP.NET Core health checks so your deployment pipelines and status pages can gate on reliability.
namespace YourApp.Observability;
public sealed class ErrorBudgetHealthCheck(ErrorBudgetMonitor monitor) : IHealthCheck
{
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
var status = await monitor.GetStatusAsync(sloTarget: 0.999, ct: cancellationToken);
var data = new Dictionary<string, object>
{
["slo_target"] = $"{status.SloTarget:P2}",
["current_sli"] = $"{status.CurrentSli:P4}",
["budget_remaining"] = status.BudgetRemaining,
["budget_percent_left"] = $"{status.BudgetPercentLeft:F1}%",
["total_requests_30d"] = status.TotalRequests,
["bad_requests_30d"] = status.BadRequests,
};
if (status.IsCritical)
return HealthCheckResult.Degraded(
"Error budget critically low (< 10% remaining)", data: data);
if (!status.IsHealthy)
return HealthCheckResult.Unhealthy(
"Error budget exhausted — SLO violated", data: data);
return HealthCheckResult.Healthy("Error budget healthy", data);
}
}Register:
builder.Services.AddHealthChecks()
.AddCheck<ErrorBudgetHealthCheck>(
"error-budget",
failureStatus: HealthStatus.Degraded,
tags: ["slo", "readiness"]);Grafana Dashboard Queries
Four panels worth having on your SLO dashboard:
# Panel 1 — Current SLI (30-day success rate)
sum(rate(http_requests_total{status_class="2xx"}[30d]))
/
sum(rate(http_requests_total[30d]))
# Panel 2 — Error budget remaining (as fraction of total budget)
1 - (
(sum(increase(http_requests_total[30d])) - sum(increase(http_requests_total{status_class="2xx"}[30d])))
/
(sum(increase(http_requests_total[30d])) * (1 - 0.999))
)
# Panel 3 — 1-hour burn rate (for fast-burn alert visibility)
(1 - sum(rate(http_requests_total{status_class="2xx"}[1h])) / sum(rate(http_requests_total[1h])))
/ (1 - 0.999)
# Panel 4 — p99 latency (feeds latency SLO)
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))Practical Advice for .NET Teams Starting Out
Start with two SLOs: availability (success rate ≥ 99.9%) and latency (p99 ≤ 500 ms). Add more only when you have data and team bandwidth to act on violations.
Use rolling windows, not calendar windows. A 30-day rolling window means you always have 30 days of data; a monthly calendar window resets budgets on the 1st and creates perverse incentives to burn budget early in the month.
Exclude health check requests from SLIs. The /health endpoint fires every few seconds from load balancers. Including it inflates your good-request count and makes your SLI misleadingly optimistic.
Exclude expected 4xx from availability SLI. A 404 because the user typed a wrong ID is not a reliability failure — it is a user error. Exclude 404 and 400-range responses from the denominator or numerator as appropriate for your domain.
Make error budget a team ritual. Review budget burn weekly in the engineering standup. When the budget drops below 50%, every new feature deployment requires explicit sign-off from the on-call engineer.