Back to blog
cloudadvanced

AKS in Production — Azure Kubernetes Service Architecture Guide

Design and operate production AKS clusters — node pools, networking (CNI vs kubenet), workload identity, scaling (HPA, KEDA, Cluster Autoscaler), blue-green deployments, monitoring, and cost control.

SystemForgeApril 18, 20269 min read
AzureAKSKubernetesProductionKEDAWorkload IdentityBicepArchitecture
Share:𝕏

AKS (Azure Kubernetes Service) is Azure's managed Kubernetes offering. Azure handles the control plane (API server, etcd, scheduler) — you manage the worker nodes, networking, and workload configuration. This guide covers the architecture decisions that determine whether an AKS deployment handles production load gracefully or becomes an operational burden.


Cluster Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                   AKS Cluster                                 │
│                                                              │
│  Control Plane (Azure-managed, free)                         │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐               │
│  │ API Server │ │    etcd    │ │  Scheduler │               │
│  └────────────┘ └────────────┘ └────────────┘               │
│                                                              │
│  System Node Pool         User Node Pools                    │
│  ┌──────────────────┐    ┌────────────────┐  ┌───────────┐  │
│  │ D4s_v5 × 3       │    │ D8s_v5 × 5     │  │ GPU Pool  │  │
│  │ (system pods:    │    │ (order-service │  │ (ML work) │  │
│  │  coredns, csi,   │    │  api-gw, etc.) │  │           │  │
│  │  metrics-server) │    └────────────────┘  └───────────┘  │
│  └──────────────────┘                                        │
└──────────────────────────────────────────────────────────────┘

System vs User Node Pools

Always separate system workloads from application workloads:

BICEP
// System node pool — runs critical cluster components
resource aksCluster 'Microsoft.ContainerService/managedClusters@2023-07-01' = {
  properties: {
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'             // AKS won't schedule user pods here by default
        vmSize: 'Standard_D4s_v5'
        count: 3                   // min 3 for HA
        availabilityZones: ['1', '2', '3']   // zone-redundant
        osType: 'Linux'
        osDiskType: 'Ephemeral'    // faster, included in VM cost
        nodeTaints: ['CriticalAddonsOnly=true:NoSchedule']  // prevent app pods
      }
      {
        name: 'appgeneral'
        mode: 'User'
        vmSize: 'Standard_D8s_v5'
        count: 3
        minCount: 2
        maxCount: 20
        enableAutoScaling: true
        availabilityZones: ['1', '2', '3']
      }
    ]
  }
}

Why separate pools:

  • System components cannot be evicted by resource pressure from app pods
  • Different node sizes per workload type (CPU-optimised, memory-optimised, GPU)
  • Independent scaling without disturbing system stability
  • Spot node pools for batch workloads (60-90% cost reduction)

Networking: Azure CNI vs Kubenet

Kubenet (default — do not use for production)

Pod IP: 10.244.x.x (not directly routable in VNet)
Traffic: Pod → NAT → Node IP → destination

Problems:
- Pods not directly routable in VNet (breaks Private Endpoints on some paths)
- Complex UDR maintenance for pod routing
- Limited to 400 nodes per cluster

Azure CNI (use this)

Each pod gets a VNet IP directly:

Pod IP: 10.1.1.x (real VNet IP, directly routable)
Traffic: Pod → destination (no NAT)

Benefits:
- Pods can be targeted directly from VNet resources
- Private Endpoints work seamlessly
- Network Policies (Calico) work at pod level
- No UDR management
BICEP
networkProfile: {
  networkPlugin: 'azure'           // Azure CNI
  networkPolicy: 'calico'         // Calico network policies
  serviceCidr: '172.16.0.0/16'   // must not overlap VNet
  dnsServiceIP: '172.16.0.10'
}

Azure CNI Overlay (newer) — pods still get VNet IPs but from a separate CIDR, solving the IP exhaustion problem of classic CNI. Use this for large clusters.

Network Policy: Zero-Trust Between Pods

By default, all pods can talk to all other pods. Calico Network Policies implement pod-level micro-segmentation:

YAML
# Allow order-service to call inventory-service only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inventory-ingress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: inventory-service
  policyTypes: [Ingress]
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: order-service
      ports:
        - port: 8080
---
# Default deny all ingress in namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: production
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Start with default-deny per namespace and explicitly allow required paths. This is zero-trust at the pod level.


Workload Identity: No Secrets in Pods

AKS Workload Identity replaces the older pod-identity addon. It federates Kubernetes service accounts with Entra ID — pods authenticate as an Entra ID managed identity without credentials.

Pod                    → OIDC token from AKS OIDC issuer
                       → Entra ID validates token
                       → Returns Azure access token
                       → Pod calls Key Vault / Service Bus / Storage
                       → No credentials anywhere
BICEP
// Enable OIDC issuer and Workload Identity on cluster
properties: {
  oidcIssuerProfile: { enabled: true }
  securityProfile: {
    workloadIdentity: { enabled: true }
  }
}
Bash
# Create Managed Identity
az identity create -n order-service-identity -g rg-prod

# Federate it with the Kubernetes service account
az identity federated-credential create \
  --identity-name order-service-identity \
  --name order-service-k8s \
  --issuer "$(az aks show -n aks-prod -g rg-prod --query oidcIssuerProfile.issuerUrl -o tsv)" \
  --subject "system:serviceaccount:production:order-service-sa"
YAML
# Kubernetes pod spec  annotated service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: order-service-sa
  namespace: production
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"
---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"   # inject token
    spec:
      serviceAccountName: order-service-sa
      containers:
        - name: order-service
          # DefaultAzureCredential picks up the federated token automatically
          # No env vars, no secrets, no mounted credentials

Scaling Architecture

HPA (Horizontal Pod Autoscaler)

Scale pods based on CPU/memory:

YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # scale out when avg CPU > 60%
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 512Mi

KEDA (Event-Driven Autoscaling)

KEDA scales based on external event sources — Service Bus queue depth, Kafka lag, HTTP request rate, etc. Scale to zero when idle.

YAML
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0        # scale to zero
  maxReplicaCount: 30
  pollingInterval: 15
  cooldownPeriod: 60
  triggers:
    - type: azure-servicebus
      metadata:
        namespace: mycompany-servicebus
        queueName: order-processing
        messageCount: "5"    # 1 pod per 5 messages
      authenticationRef:
        name: keda-servicebus-auth

KEDA scale-to-zero — 0 pods when queue is empty, N pods proportional to queue depth. This is the optimal pattern for async processing: zero cost at idle, maximum throughput at peak.

Cluster Autoscaler

Adds and removes nodes automatically based on pod scheduling pressure:

BICEP
agentPoolProfiles: [{
  enableAutoScaling: true
  minCount: 2
  maxCount: 20
  // Cluster Autoscaler adds a node when pods are Pending (unschedulable)
  // Removes a node when utilisation < 50% for 10 minutes
}]

Scale-out flow:

  1. New pods cannot be scheduled (resource pressure or zone balance)
  2. Cluster Autoscaler detects pending pods
  3. New node is provisioned (typically 2–4 minutes for AKS)
  4. Pods are scheduled on new node

Scale-in flow:

  1. Node utilisation below threshold for scale-down-delay (default 10 min)
  2. Pods safely drained and rescheduled
  3. Node removed

For workloads needing faster scale-out, use node pool burst to virtual nodes (ACI) — pods overflow to Azure Container Instances in seconds.


Blue-Green and Canary Deployments

Blue-Green via Deployment Slots (NGINX Ingress)

YAML
# Two deployments  only one receives traffic at a time
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: order-service-ingress
  annotations:
    nginx.ingress.kubernetes.io/canary: "false"  # blue = active
spec:
  rules:
    - host: api.myapp.com
      http:
        paths:
          - path: /api/orders
            backend:
              service:
                name: order-service-blue   # switch to green after validation
                port: { number: 80 }

Canary via NGINX Weight

YAML
# Green deployment receives 10% of traffic
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: order-service-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"   # 10% to canary
spec:
  rules:
    - host: api.myapp.com
      http:
        paths:
          - path: /api/orders
            backend:
              service:
                name: order-service-green
                port: { number: 80 }

Monitor error rate and P99 latency on canary traffic. Ramp weight to 50% → 100% if healthy. Roll back by removing the canary ingress.


AKS Monitoring

Container Insights

BICEP
addonProfiles: {
  omsagent: {
    enabled: true
    config: {
      logAnalyticsWorkspaceResourceID: logAnalyticsWorkspace.id
    }
  }
}

Container Insights collects node metrics, pod metrics, container logs, and health data into Log Analytics.

KQL
// KQL: Pods with high restart count (crash-looping)
KubePodInventory
| where TimeGenerated > ago(1h)
| where Namespace == 'production'
| where RestartCount > 5
| project PodName, ContainerName, RestartCount, PodStatus
| order by RestartCount desc
KQL
// KQL: Node CPU pressure
Perf
| where TimeGenerated > ago(30m)
| where ObjectName == "K8SNode" and CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by Computer
| order by AvgCPU desc

Pod Disruption Budgets: Preventing Scale-Down Outages

Without a PodDisruptionBudget (PDB), Cluster Autoscaler can drain a node and take down all replicas of a deployment simultaneously:

YAML
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
spec:
  minAvailable: 2     # always keep at least 2 pods running during disruption
  selector:
    matchLabels:
      app: order-service

Always define PDBs for production deployments — protects against node drains, cluster upgrades, and Cluster Autoscaler scale-in.


Cost Control in AKS

Spot Node Pools for Batch Workloads

Spot VMs offer 60–90% cost reduction at the cost of potential eviction (2-minute warning):

BICEP
{
  name: 'spot'
  mode: 'User'
  vmSize: 'Standard_D8s_v5'
  scaleSetPriority: 'Spot'
  spotMaxPrice: -1      // -1 = up to on-demand price
  nodeTaints: ['kubernetes.azure.com/scalesetpriority=spot:NoSchedule']
  nodeLabels: { 'kubernetes.azure.com/scalesetpriority': 'spot' }
}
YAML
# Schedule batch jobs on spot pool only
tolerations:
  - key: "kubernetes.azure.com/scalesetpriority"
    operator: "Equal"
    value: "spot"
    effect: "NoSchedule"
nodeSelector:
  kubernetes.azure.com/scalesetpriority: spot

Resource Requests and Limits

Kubernetes schedules based on requests and throttles at limits. Getting these right prevents both over-provisioning and throttling:

YAML
resources:
  requests:
    cpu: "250m"      # what scheduler reserves on the node
    memory: "256Mi"
  limits:
    cpu: "1000m"     # max allowed (throttled if exceeded)
    memory: "512Mi"  # OOMKilled if exceeded

Common mistake: setting limits much higher than requests. The node shows low utilisation, Cluster Autoscaler doesn't scale out, pods get scheduled, then all burst simultaneously and throttle.

Rule: start with requests = 70% of observed p95 usage. Set limits = 2× requests. Tune with actual metrics.


Production Checklist

  • [ ] System and user node pools separated (system taint applied)
  • [ ] Azure CNI (not kubenet) with Calico network policy
  • [ ] Workload Identity enabled (no secrets in pods)
  • [ ] Zone-redundant node pools (AZs: 1, 2, 3)
  • [ ] HPA configured for all stateless services
  • [ ] KEDA for event-driven/async processors
  • [ ] Cluster Autoscaler with min 2 nodes (never scale to 0 user nodes)
  • [ ] PodDisruptionBudgets on all production deployments
  • [ ] Resource requests and limits set on every container
  • [ ] Container Insights enabled → Log Analytics
  • [ ] Spot node pool for batch/non-critical workloads
  • [ ] Private cluster (API server not public) for regulated workloads
  • [ ] Azure Policy for Kubernetes (Gatekeeper) for policy enforcement

Related: Azure Hub-Spoke Networking — VNet design, Private Endpoints
Related: Azure Well-Architected Framework — reliability, scaling patterns
Related: Azure Cloud Integration — KEDA with Service Bus

Enjoyed this article?

Explore the learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.