Platform Engineering: Observability Platform — Prometheus, Grafana, Loki, Tempo, and OpenTelemetry

The Three Pillars — and the Fourth

Metrics (Prometheus): numeric aggregates over time — request rate, error rate, latency percentiles, CPU, memory.

Logs (Loki): structured event records — what happened, when, and with what context.

Traces (Tempo): request flows across services — which calls were made, how long each took, where failures originated.

Events (Kubernetes Events): cluster-level state changes — pod evictions, scheduling failures, OOMKills. Often ignored until something breaks.

The observability platform's job is to collect all four, correlate them (a trace links to its logs, logs link to their metrics), and make them queryable without teams configuring anything.

kube-prometheus-stack: The Foundation

The kube-prometheus-stack Helm chart installs Prometheus Operator, Alertmanager, Grafana, and a full set of Kubernetes dashboards in a single command:

Bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values observability-values.yaml

observability-values.yaml:

YAML

prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: "50GB"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 100Gi
    # Discover ServiceMonitors in ALL namespaces
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}
    podMonitorSelector: {}
    podMonitorNamespaceSelector: {}
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        memory: 8Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 10Gi

grafana:
  adminPassword: "change-me-via-secret"
  persistence:
    enabled: true
    size: 10Gi
  plugins:
    - grafana-piechart-panel
    - grafana-clock-panel

# Include all Kubernetes component metrics
kubeEtcd:
  enabled: true
kubeScheduler:
  enabled: true
kubeControllerManager:
  enabled: true

After install, Grafana ships with 20+ pre-built dashboards for Kubernetes nodes, pods, namespaces, API server, etcd, and persistent volumes.

ServiceMonitor: Auto-Discover Application Metrics

Teams expose /metrics (Prometheus format) from their app. A ServiceMonitor tells Prometheus to scrape it — without editing any Prometheus configuration.

Team creates a ServiceMonitor

YAML

# Applied by the team alongside their app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payment-service
  namespace: payments
  labels:
    app: payment-service
spec:
  selector:
    matchLabels:
      app: payment-service
  endpoints:
    - port: metrics        # matches Service port named "metrics"
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      metricRelabelings:
        # Drop high-cardinality metrics that would explode storage
        - sourceLabels: [__name__]
          regex: "go_gc_.*"
          action: drop

Prometheus discovers this ServiceMonitor via its label selector and starts scraping automatically. No Prometheus configuration files to edit.

PodMonitor: scrape pods without a Service

For cases where there's no Kubernetes Service in front of the pod (e.g., daemonset agents):

YAML

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: fluentd-pods
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  podMetricsEndpoints:
    - port: metrics
      path: /metrics

PrometheusRule: define alerts as code

YAML

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-alerts
  namespace: payments
spec:
  groups:
    - name: payment-service
      interval: 30s
      rules:
        # Alert when error rate > 1% for 5 minutes
        - alert: PaymentServiceHighErrorRate
          expr: |
            sum(rate(http_requests_total{
              job="payment-service",
              status=~"5.."
            }[5m]))
            /
            sum(rate(http_requests_total{job="payment-service"}[5m]))
            > 0.01
          for: 5m
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payment service error rate > 1%"
            description: "Error rate is {{ $value | humanizePercentage }}"
            runbook_url: "https://runbooks.example.com/payment-high-error-rate"

        # Alert when p99 latency > 500ms
        - alert: PaymentServiceSlowP99
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                job="payment-service"
              }[5m])) by (le)
            ) > 0.5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Payment service p99 latency > 500ms"

Grafana: Dashboards as Code

Grafana dashboards managed manually drift and diverge across teams. Dashboard-as-code via ConfigMaps ensures dashboards are reproducible and version-controlled:

YAML

# ConfigMap with dashboard JSON — Grafana sidecar auto-imports it
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-service-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"   # Grafana sidecar watches this label
data:
  payment-service.json: |
    {
      "title": "Payment Service",
      "uid": "payment-service",
      "panels": [
        {
          "title": "Request Rate",
          "type": "graph",
          "targets": [
            {
              "expr": "sum(rate(http_requests_total{job='payment-service'}[5m])) by (status)"
            }
          ]
        },
        {
          "title": "P99 Latency",
          "type": "graph",
          "targets": [
            {
              "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job='payment-service'}[5m])) by (le))"
            }
          ]
        }
      ]
    }

Enable the Grafana sidecar in Helm values:

YAML

grafana:
  sidecar:
    dashboards:
      enabled: true
      label: grafana_dashboard
      searchNamespace: ALL    # find dashboards in any namespace

Every team ships their dashboard alongside their app. No centralized Grafana admin required.

Loki: Log Aggregation

Loki is Prometheus-like log aggregation — it indexes only log labels (not full-text), keeping storage costs low while making logs queryable by namespace, pod, container, and custom labels.

Install

Bash

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set grafana.enabled=false \   # use existing Grafana
  --set promtail.enabled=true \   # deploy Promtail as DaemonSet
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi

Promtail (DaemonSet on every node) tails /var/log/pods/ and ships logs to Loki with Kubernetes labels automatically attached: namespace, pod, container, node.

LogQL: querying logs

LOGQL

# All logs from the payment-service in the last hour with errors
{namespace="payments", app="payment-service"} |= "error" | json

# Error rate over time (metric query from logs)
sum(rate({namespace="payments"} |= "ERROR" [5m])) by (app)

# Parse JSON logs and filter by field
{namespace="payments"} 
  | json 
  | level="ERROR" 
  | traceId="abc123"

# Show slow requests (parsed from structured logs)
{namespace="payments"} 
  | json 
  | duration > 1000
  | line_format "{{.method}} {{.path}} {{.duration}}ms"

Structured logging standard

Platform teams should mandate structured JSON logs:

JSON

{
  "timestamp": "2026-06-11T10:23:45Z",
  "level": "ERROR",
  "message": "Payment processing failed",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "userId": "usr_abc123",
  "paymentId": "pay_xyz789",
  "error": "insufficient_funds",
  "durationMs": 45
}

Include traceId and spanId in every log line. This enables log-to-trace correlation in Grafana: click a log line → jump to the Tempo trace for that request.

Tempo: Distributed Tracing

Tempo stores traces at low cost using object storage (S3/GCS). It integrates with Grafana for trace search and visualization.

Install

Bash

helm install tempo grafana/tempo \
  --namespace monitoring \
  --set tempo.storage.trace.backend=s3 \
  --set tempo.storage.trace.s3.bucket=trace-storage \
  --set tempo.storage.trace.s3.region=eu-west-1

OpenTelemetry Collector: the pipeline

The OpenTelemetry Collector acts as a central telemetry pipeline — receives traces, metrics, and logs from apps, processes them, and routes to backends:

YAML

# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  mode: DaemonSet   # one collector per node
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      # Also receive Jaeger and Zipkin formats
      jaeger:
        protocols:
          grpc:
      zipkin:

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      # Add k8s metadata to every span
      k8sattributes:
        auth_type: serviceAccount
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.node.name

    exporters:
      otlp/tempo:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
      prometheus:
        endpoint: 0.0.0.0:8889   # export span metrics to Prometheus

    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger, zipkin]
          processors: [batch, k8sattributes]
          exporters: [otlp/tempo]

Apps send traces to http://otel-collector.monitoring:4318 (HTTP) or otel-collector.monitoring:4317 (gRPC). The collector adds Kubernetes metadata automatically — no need for each app to know its pod name or namespace.

Instrumenting apps

Python (FastAPI) with automatic instrumentation:

Bash

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install

Python

# No code changes needed — auto-instrumentation via env vars
# In Kubernetes Deployment:
env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.monitoring:4318"
  - name: OTEL_SERVICE_NAME
    value: "payment-service"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=production"

.NET auto-instrumentation via OpenTelemetry Operator:

YAML

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: dotnet-instrumentation
  namespace: payments
spec:
  exporter:
    endpoint: http://otel-collector.monitoring:4318
  propagators:
    - tracecontext
    - baggage
  dotnet:
    env:
      - name: OTEL_SERVICE_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.labels['app']

Add one annotation to a Deployment and auto-instrumentation injects the OpenTelemetry SDK:

YAML

metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-dotnet: "true"

Exemplars: Linking Metrics to Traces

Exemplars are sample trace IDs attached to Prometheus metric observations. When you see a spike in a Grafana metric graph, click a data point → Grafana shows the trace IDs that were active at that moment → one click to the Tempo trace.

// Go: record a histogram observation with an exemplar
histogram.With(prometheus.Labels{}).ObserveWithExemplar(
    duration.Seconds(),
    prometheus.Labels{"traceID": traceID},
)

Enable exemplar storage in Prometheus:

YAML

prometheusSpec:
  enableFeatures:
    - exemplar-storage

Enable in Grafana data source:

YAML

datasources:
  - name: Prometheus
    type: prometheus
    url: http://kube-prometheus-stack-prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo   # link to Tempo datasource

Alerting Strategy: Symptom-Based Alerts

Anti-pattern: alert on causes — "CPU > 80%", "disk > 70%", "pod restarted".

These generate noise. A pod restarting once is not an incident. CPU at 80% may be normal.

Pattern: alert on user-visible symptoms — error rate, latency, availability.

YAML

# Golden signals alerting — the four metrics that matter
rules:
  # 1. Error rate (anything that returns 5xx to users)
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
      /
      sum(rate(http_requests_total[5m])) by (job)
      > 0.01

  # 2. Latency (p99 over SLO)
  - alert: HighLatency
    expr: |
      histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
      > 1.0

  # 3. Saturation (queue growing — indicates capacity issue)
  - alert: RequestQueueDepthHigh
    expr: http_request_queue_depth > 100

  # 4. Traffic drop (sudden drop may indicate failure, not just low traffic)
  - alert: TrafficDropped
    expr: |
      sum(rate(http_requests_total[5m])) by (job)
      < sum(rate(http_requests_total[5m] offset 1h)) by (job) * 0.5

Multi-burn-rate alerts for SLOs

YAML

# Alert when error budget is burning fast (page now) or slowly (ticket)
- alert: SLOErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{job="payment-service",status=~"5.."}[1h]))
      / sum(rate(http_requests_total{job="payment-service"}[1h]))
    ) > (14.4 * 0.01)   # 14.4x the SLO error rate = burn through budget in 5 days
  labels:
    severity: critical   # page the on-call
    slo: payment-service

- alert: SLOErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{job="payment-service",status=~"5.."}[6h]))
      / sum(rate(http_requests_total{job="payment-service"}[6h]))
    ) > (6 * 0.01)       # 6x the SLO error rate = burns budget in 5 days at this rate
  labels:
    severity: warning    # create a ticket

Alertmanager Routing

YAML

# alertmanager-config.yaml
route:
  group_by: [alertname, team, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: default-receiver
  routes:
    - matchers:
        - severity="critical"
      receiver: pagerduty-critical
      continue: false

    - matchers:
        - team="payments"
      receiver: payments-slack

    - matchers:
        - severity="warning"
      receiver: slack-warnings

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_KEY}"
        description: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

  - name: payments-slack
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_PAYMENTS}"
        channel: "#payments-alerts"
        title: "{{ .CommonAnnotations.summary }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\nRunbook: {{ .Annotations.runbook_url }}{{ end }}"

Route by team label — every team's PrometheusRule sets a team label. Alertmanager routes alerts to the right Slack channel automatically. Platform team manages the routing config; teams own their alert rules.

Platform Observability Self-Check

Monitor the observability platform itself:

YAML

# Alert if Prometheus has scrape failures
- alert: PrometheusScrapeFailed
  expr: up == 0
  for: 5m
  annotations:
    summary: "Prometheus cannot scrape {{ $labels.job }} in {{ $labels.namespace }}"

# Alert if Loki is dropping logs
- alert: LokiDroppedLogs
  expr: sum(rate(loki_distributor_lines_received_total[5m])) - sum(rate(loki_ingester_lines_received_total[5m])) > 0

# Alert if OTel Collector has export failures
- alert: OTelCollectorExportFailed
  expr: sum(rate(otelcol_exporter_send_failed_spans_total[5m])) > 0

Grafana Explore: Correlated Debugging

The full correlation flow in Grafana Explore:

1. Alert fires: PaymentServiceHighErrorRate
2. Open Grafana Explore → Prometheus → query: error rate spike at 14:32
3. Click data point → Exemplar → jump to Tempo trace (TraceID: abc123)
4. Trace shows: payment-service → inventory-service (timeout at 2100ms)
5. Switch to Loki → filter: {namespace="inventory"} traceId="abc123"
6. Log shows: "database connection pool exhausted (connections: 10/10)"
7. Root cause found in < 5 minutes

This is the end state the observability platform should make possible: from alert to root cause without switching tools, without grep-ing raw logs, without asking another team.

Platform Engineering: Observability Platform — Prometheus, Grafana, Loki, Tempo, and OpenTelemetry

The Three Pillars — and the Fourth

kube-prometheus-stack: The Foundation

ServiceMonitor: Auto-Discover Application Metrics

Team creates a ServiceMonitor

PodMonitor: scrape pods without a Service

PrometheusRule: define alerts as code

Grafana: Dashboards as Code

Loki: Log Aggregation

Install

LogQL: querying logs

Structured logging standard

Tempo: Distributed Tracing

Install

OpenTelemetry Collector: the pipeline

Instrumenting apps

Exemplars: Linking Metrics to Traces

Alerting Strategy: Symptom-Based Alerts

Multi-burn-rate alerts for SLOs

Alertmanager Routing

Platform Observability Self-Check

Grafana Explore: Correlated Debugging

Enjoyed this article?

Leave a comment