Scale to Zero with Azure Container Apps

Why Scale to Zero Matters for LLM Services

A standard VM or AKS cluster costs money 24/7 whether it serves requests or not. Azure Container Apps uses KEDA (Kubernetes Event-Driven Autoscaling) to scale replicas from 0 to N based on demand.

For LLM services this is powerful: a dev or staging environment that gets no traffic at night scales to zero — zero cost, zero waste.

But there's a catch: scaling from 0 to 1 takes 15–30 seconds (cold start: pull image, start Python, load dependencies). For user-facing APIs, this is unacceptable. Solution: scale to minimum 1 in production, zero only in dev/staging.

Azure Container Apps Scaling Architecture

HTTP traffic arrives
        │
        ▼
KEDA HTTP scaler detects requests in queue
        │
        ▼
Azure Container Apps adds a replica (0→1, 1→2, ...)
        │
        ▼
Container starts, health check passes
        │
        ▼
Traffic routed to new replica

Basic HTTP-Based Scaling

Bash

az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --min-replicas 1 \
  --max-replicas 10 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 10

--scale-rule-http-concurrency 10: Add a replica when more than 10 concurrent requests are being handled by a single replica.

Resulting behaviour:

1 replica handles 0–10 concurrent requests
At 11 concurrent requests: scales to 2 replicas
At 21 concurrent requests: scales to 3 replicas
When idle: scales back down to 1 (not 0, because min-replicas=1)

Scale to Zero (Dev/Staging)

Bash

az containerapp update \
  --name pharmabot-staging \
  --resource-group pharmabot-rg \
  --min-replicas 0 \     # Allow scaling to zero
  --max-replicas 5 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 5

When no traffic arrives for 5 minutes, the app scales to 0. The next request triggers a cold start.

Warming up a cold start:

Bash

# Add a startup probe with generous timeout
az containerapp ingress update \
  --name pharmabot-staging \
  --resource-group pharmabot-rg

In containerapp.yaml:

YAML

probes:
  - type: startup
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 12  # 60 seconds total before marking failed

Custom Scaling: Scale on Queue Depth

If your LLM service processes jobs from an Azure Service Bus queue:

Bash

az containerapp update \
  --name pharmabot-worker \
  --resource-group pharmabot-rg \
  --min-replicas 0 \
  --max-replicas 20 \
  --scale-rule-name queue-rule \
  --scale-rule-type azure-servicebus \
  --scale-rule-metadata queueName=pharmabot-jobs \
                         queueLength=5 \
                         namespace=pharmabot-servicebus \
  --scale-rule-auth trigger=servicebus-connection \
                     secretRef=servicebus-conn

queueLength=5: Add a replica for every 5 messages in the queue. 100 messages → 20 replicas.

Scaling in a Container App YAML

You can also define scaling in a containerapp.yaml file (better for GitOps):

YAML

properties:
  template:
    scale:
      minReplicas: 1
      maxReplicas: 10
      rules:
        - name: http-rule
          http:
            metadata:
              concurrentRequests: "10"
        - name: cpu-rule
          custom:
            type: cpu
            metadata:
              type: Utilization
              value: "70"  # Scale up if CPU over 70%

Deploy:

Bash

az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --yaml containerapp.yaml

Understanding Scale-Down Cooldown

By default, Container Apps waits 5 minutes of idle before scaling down. Tune this to avoid flapping:

YAML

scale:
  minReplicas: 1
  maxReplicas: 10
  cooldownPeriod: 300  # 5 minutes before scaling down
  pollingInterval: 30  # check every 30 seconds

LLM-specific consideration: LLM requests can be slow (30s+). A scale-down during a long inference request would kill it mid-stream. The cooldown ensures the replica stays alive until requests finish.

Cost Impact of Scale to Zero

Example: pharmabot staging environment, 2 vCPU / 4GB RAM

| Config | Monthly cost | |---|---| | Always 1 replica running | ~$45/month | | Scale to zero, 2h/day active | ~$3/month | | Scale to zero, 8h/day active | ~$11/month |

For 5 staging environments across a team, scale-to-zero saves ~$200/month.

Monitoring Scaling Events

KUSTO

// View scaling events in Log Analytics
ContainerAppSystemLogs_CL
| where Reason_s in ("ScalingReplicaAdded", "ScalingReplicaRemoved")
| where TimeGenerated > ago(24h)
| project TimeGenerated, Reason_s, Count_d, ContainerApp_s
| order by TimeGenerated desc

Checkpoint

Check your current scaling configuration:

Bash

az containerapp show \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --query "properties.template.scale" \
  -o yaml

Expected output for production:

YAML

minReplicas: 1
maxReplicas: 10
rules:
  - name: http-rule
    http:
      metadata:
        concurrentRequests: '10'

Then run a load test to verify it scales:

Bash

# Install hey (HTTP load tester)
hey -n 1000 -c 50 http://pharmabot.example.com/api/chat -m POST -H "Content-Type: application/json" -d '{"message":"test"}'

Watch the replica count increase in the Azure Portal while the load test runs.

Scale to Zero with Azure Container Apps

Why Scale to Zero Matters for LLM Services

Azure Container Apps Scaling Architecture

Basic HTTP-Based Scaling

Scale to Zero (Dev/Staging)

Custom Scaling: Scale on Queue Depth

Scaling in a Container App YAML

Understanding Scale-Down Cooldown

Cost Impact of Scale to Zero

Monitoring Scaling Events

Checkpoint

Enjoyed this article?

Leave a comment