Platform Engineering: Observability Platform — Prometheus, Grafana, Loki, Tempo, and OpenTelemetry
Build a production observability stack on Kubernetes — Prometheus Operator with ServiceMonitor/PodMonitor, Grafana dashboards and alerting, Loki log aggregation, Tempo distributed tracing, OpenTelemetry Collector, and SLO-based alerting strategy.
The Three Pillars — and the Fourth
Metrics (Prometheus): numeric aggregates over time — request rate, error rate, latency percentiles, CPU, memory.
Logs (Loki): structured event records — what happened, when, and with what context.
Traces (Tempo): request flows across services — which calls were made, how long each took, where failures originated.
Events (Kubernetes Events): cluster-level state changes — pod evictions, scheduling failures, OOMKills. Often ignored until something breaks.
The observability platform's job is to collect all four, correlate them (a trace links to its logs, logs link to their metrics), and make them queryable without teams configuring anything.
kube-prometheus-stack: The Foundation
The kube-prometheus-stack Helm chart installs Prometheus Operator, Alertmanager, Grafana, and a full set of Kubernetes dashboards in a single command:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values observability-values.yamlobservability-values.yaml:
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "50GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
# Discover ServiceMonitors in ALL namespaces
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
podMonitorSelector: {}
podMonitorNamespaceSelector: {}
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
memory: 8Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 10Gi
grafana:
adminPassword: "change-me-via-secret"
persistence:
enabled: true
size: 10Gi
plugins:
- grafana-piechart-panel
- grafana-clock-panel
# Include all Kubernetes component metrics
kubeEtcd:
enabled: true
kubeScheduler:
enabled: true
kubeControllerManager:
enabled: trueAfter install, Grafana ships with 20+ pre-built dashboards for Kubernetes nodes, pods, namespaces, API server, etcd, and persistent volumes.
ServiceMonitor: Auto-Discover Application Metrics
Teams expose /metrics (Prometheus format) from their app. A ServiceMonitor tells Prometheus to scrape it — without editing any Prometheus configuration.
Team creates a ServiceMonitor
# Applied by the team alongside their app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payment-service
namespace: payments
labels:
app: payment-service
spec:
selector:
matchLabels:
app: payment-service
endpoints:
- port: metrics # matches Service port named "metrics"
path: /metrics
interval: 30s
scrapeTimeout: 10s
metricRelabelings:
# Drop high-cardinality metrics that would explode storage
- sourceLabels: [__name__]
regex: "go_gc_.*"
action: dropPrometheus discovers this ServiceMonitor via its label selector and starts scraping automatically. No Prometheus configuration files to edit.
PodMonitor: scrape pods without a Service
For cases where there's no Kubernetes Service in front of the pod (e.g., daemonset agents):
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: fluentd-pods
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
podMetricsEndpoints:
- port: metrics
path: /metricsPrometheusRule: define alerts as code
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payment-service-alerts
namespace: payments
spec:
groups:
- name: payment-service
interval: 30s
rules:
# Alert when error rate > 1% for 5 minutes
- alert: PaymentServiceHighErrorRate
expr: |
sum(rate(http_requests_total{
job="payment-service",
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{job="payment-service"}[5m]))
> 0.01
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "Payment service error rate > 1%"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://runbooks.example.com/payment-high-error-rate"
# Alert when p99 latency > 500ms
- alert: PaymentServiceSlowP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
job="payment-service"
}[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Payment service p99 latency > 500ms"Grafana: Dashboards as Code
Grafana dashboards managed manually drift and diverge across teams. Dashboard-as-code via ConfigMaps ensures dashboards are reproducible and version-controlled:
# ConfigMap with dashboard JSON — Grafana sidecar auto-imports it
apiVersion: v1
kind: ConfigMap
metadata:
name: payment-service-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # Grafana sidecar watches this label
data:
payment-service.json: |
{
"title": "Payment Service",
"uid": "payment-service",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job='payment-service'}[5m])) by (status)"
}
]
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job='payment-service'}[5m])) by (le))"
}
]
}
]
}Enable the Grafana sidecar in Helm values:
grafana:
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
searchNamespace: ALL # find dashboards in any namespaceEvery team ships their dashboard alongside their app. No centralized Grafana admin required.
Loki: Log Aggregation
Loki is Prometheus-like log aggregation — it indexes only log labels (not full-text), keeping storage costs low while making logs queryable by namespace, pod, container, and custom labels.
Install
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false \ # use existing Grafana
--set promtail.enabled=true \ # deploy Promtail as DaemonSet
--set loki.persistence.enabled=true \
--set loki.persistence.size=50GiPromtail (DaemonSet on every node) tails /var/log/pods/ and ships logs to Loki with Kubernetes labels automatically attached: namespace, pod, container, node.
LogQL: querying logs
# All logs from the payment-service in the last hour with errors
{namespace="payments", app="payment-service"} |= "error" | json
# Error rate over time (metric query from logs)
sum(rate({namespace="payments"} |= "ERROR" [5m])) by (app)
# Parse JSON logs and filter by field
{namespace="payments"}
| json
| level="ERROR"
| traceId="abc123"
# Show slow requests (parsed from structured logs)
{namespace="payments"}
| json
| duration > 1000
| line_format "{{.method}} {{.path}} {{.duration}}ms"Structured logging standard
Platform teams should mandate structured JSON logs:
{
"timestamp": "2026-06-11T10:23:45Z",
"level": "ERROR",
"message": "Payment processing failed",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"userId": "usr_abc123",
"paymentId": "pay_xyz789",
"error": "insufficient_funds",
"durationMs": 45
}Include traceId and spanId in every log line. This enables log-to-trace correlation in Grafana: click a log line → jump to the Tempo trace for that request.
Tempo: Distributed Tracing
Tempo stores traces at low cost using object storage (S3/GCS). It integrates with Grafana for trace search and visualization.
Install
helm install tempo grafana/tempo \
--namespace monitoring \
--set tempo.storage.trace.backend=s3 \
--set tempo.storage.trace.s3.bucket=trace-storage \
--set tempo.storage.trace.s3.region=eu-west-1OpenTelemetry Collector: the pipeline
The OpenTelemetry Collector acts as a central telemetry pipeline — receives traces, metrics, and logs from apps, processes them, and routes to backends:
# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: monitoring
spec:
mode: DaemonSet # one collector per node
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Also receive Jaeger and Zipkin formats
jaeger:
protocols:
grpc:
zipkin:
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Add k8s metadata to every span
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889 # export span metrics to Prometheus
service:
pipelines:
traces:
receivers: [otlp, jaeger, zipkin]
processors: [batch, k8sattributes]
exporters: [otlp/tempo]Apps send traces to http://otel-collector.monitoring:4318 (HTTP) or otel-collector.monitoring:4317 (gRPC). The collector adds Kubernetes metadata automatically — no need for each app to know its pod name or namespace.
Instrumenting apps
Python (FastAPI) with automatic instrumentation:
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install# No code changes needed — auto-instrumentation via env vars
# In Kubernetes Deployment:
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.monitoring:4318"
- name: OTEL_SERVICE_NAME
value: "payment-service"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=production".NET auto-instrumentation via OpenTelemetry Operator:
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: dotnet-instrumentation
namespace: payments
spec:
exporter:
endpoint: http://otel-collector.monitoring:4318
propagators:
- tracecontext
- baggage
dotnet:
env:
- name: OTEL_SERVICE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']Add one annotation to a Deployment and auto-instrumentation injects the OpenTelemetry SDK:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-dotnet: "true"Exemplars: Linking Metrics to Traces
Exemplars are sample trace IDs attached to Prometheus metric observations. When you see a spike in a Grafana metric graph, click a data point → Grafana shows the trace IDs that were active at that moment → one click to the Tempo trace.
// Go: record a histogram observation with an exemplar
histogram.With(prometheus.Labels{}).ObserveWithExemplar(
duration.Seconds(),
prometheus.Labels{"traceID": traceID},
)Enable exemplar storage in Prometheus:
prometheusSpec:
enableFeatures:
- exemplar-storageEnable in Grafana data source:
datasources:
- name: Prometheus
type: prometheus
url: http://kube-prometheus-stack-prometheus:9090
jsonData:
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo # link to Tempo datasourceAlerting Strategy: Symptom-Based Alerts
Anti-pattern: alert on causes — "CPU > 80%", "disk > 70%", "pod restarted".
These generate noise. A pod restarting once is not an incident. CPU at 80% may be normal.
Pattern: alert on user-visible symptoms — error rate, latency, availability.
# Golden signals alerting — the four metrics that matter
rules:
# 1. Error rate (anything that returns 5xx to users)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
> 0.01
# 2. Latency (p99 over SLO)
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
> 1.0
# 3. Saturation (queue growing — indicates capacity issue)
- alert: RequestQueueDepthHigh
expr: http_request_queue_depth > 100
# 4. Traffic drop (sudden drop may indicate failure, not just low traffic)
- alert: TrafficDropped
expr: |
sum(rate(http_requests_total[5m])) by (job)
< sum(rate(http_requests_total[5m] offset 1h)) by (job) * 0.5Multi-burn-rate alerts for SLOs
# Alert when error budget is burning fast (page now) or slowly (ticket)
- alert: SLOErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{job="payment-service",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="payment-service"}[1h]))
) > (14.4 * 0.01) # 14.4x the SLO error rate = burn through budget in 5 days
labels:
severity: critical # page the on-call
slo: payment-service
- alert: SLOErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{job="payment-service",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{job="payment-service"}[6h]))
) > (6 * 0.01) # 6x the SLO error rate = burns budget in 5 days at this rate
labels:
severity: warning # create a ticketAlertmanager Routing
# alertmanager-config.yaml
route:
group_by: [alertname, team, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default-receiver
routes:
- matchers:
- severity="critical"
receiver: pagerduty-critical
continue: false
- matchers:
- team="payments"
receiver: payments-slack
- matchers:
- severity="warning"
receiver: slack-warnings
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "${PAGERDUTY_KEY}"
description: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
- name: payments-slack
slack_configs:
- api_url: "${SLACK_WEBHOOK_PAYMENTS}"
channel: "#payments-alerts"
title: "{{ .CommonAnnotations.summary }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\nRunbook: {{ .Annotations.runbook_url }}{{ end }}"Route by team label — every team's PrometheusRule sets a team label. Alertmanager routes alerts to the right Slack channel automatically. Platform team manages the routing config; teams own their alert rules.
Platform Observability Self-Check
Monitor the observability platform itself:
# Alert if Prometheus has scrape failures
- alert: PrometheusScrapeFailed
expr: up == 0
for: 5m
annotations:
summary: "Prometheus cannot scrape {{ $labels.job }} in {{ $labels.namespace }}"
# Alert if Loki is dropping logs
- alert: LokiDroppedLogs
expr: sum(rate(loki_distributor_lines_received_total[5m])) - sum(rate(loki_ingester_lines_received_total[5m])) > 0
# Alert if OTel Collector has export failures
- alert: OTelCollectorExportFailed
expr: sum(rate(otelcol_exporter_send_failed_spans_total[5m])) > 0Grafana Explore: Correlated Debugging
The full correlation flow in Grafana Explore:
1. Alert fires: PaymentServiceHighErrorRate
2. Open Grafana Explore → Prometheus → query: error rate spike at 14:32
3. Click data point → Exemplar → jump to Tempo trace (TraceID: abc123)
4. Trace shows: payment-service → inventory-service (timeout at 2100ms)
5. Switch to Loki → filter: {namespace="inventory"} traceId="abc123"
6. Log shows: "database connection pool exhausted (connections: 10/10)"
7. Root cause found in < 5 minutesThis is the end state the observability platform should make possible: from alert to root cause without switching tools, without grep-ing raw logs, without asking another team.
Enjoyed this article?
Explore the Cloud & DevOps learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.