Back to blog
AI Systemsintermediate

Scale to Zero with Azure Container Apps

Configure Azure Container Apps to automatically scale your LLM service based on HTTP traffic, KEDA rules, and custom metrics — including scaling to zero replicas when idle.

Asma Hafeez KhanMay 15, 20264 min read
LLMOpsAzure Container AppsScalingKEDAAzure
Share:𝕏

Why Scale to Zero Matters for LLM Services

A standard VM or AKS cluster costs money 24/7 whether it serves requests or not. Azure Container Apps uses KEDA (Kubernetes Event-Driven Autoscaling) to scale replicas from 0 to N based on demand.

For LLM services this is powerful: a dev or staging environment that gets no traffic at night scales to zero — zero cost, zero waste.

But there's a catch: scaling from 0 to 1 takes 15–30 seconds (cold start: pull image, start Python, load dependencies). For user-facing APIs, this is unacceptable. Solution: scale to minimum 1 in production, zero only in dev/staging.


Azure Container Apps Scaling Architecture

HTTP traffic arrives
        │
        ▼
KEDA HTTP scaler detects requests in queue
        │
        ▼
Azure Container Apps adds a replica (0→1, 1→2, ...)
        │
        ▼
Container starts, health check passes
        │
        ▼
Traffic routed to new replica

Basic HTTP-Based Scaling

Bash
az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --min-replicas 1 \
  --max-replicas 10 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 10

--scale-rule-http-concurrency 10: Add a replica when more than 10 concurrent requests are being handled by a single replica.

Resulting behaviour:

  • 1 replica handles 0–10 concurrent requests
  • At 11 concurrent requests: scales to 2 replicas
  • At 21 concurrent requests: scales to 3 replicas
  • When idle: scales back down to 1 (not 0, because min-replicas=1)

Scale to Zero (Dev/Staging)

Bash
az containerapp update \
  --name pharmabot-staging \
  --resource-group pharmabot-rg \
  --min-replicas 0 \     # Allow scaling to zero
  --max-replicas 5 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 5

When no traffic arrives for 5 minutes, the app scales to 0. The next request triggers a cold start.

Warming up a cold start:

Bash
# Add a startup probe with generous timeout
az containerapp ingress update \
  --name pharmabot-staging \
  --resource-group pharmabot-rg

In containerapp.yaml:

YAML
probes:
  - type: startup
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 12  # 60 seconds total before marking failed

Custom Scaling: Scale on Queue Depth

If your LLM service processes jobs from an Azure Service Bus queue:

Bash
az containerapp update \
  --name pharmabot-worker \
  --resource-group pharmabot-rg \
  --min-replicas 0 \
  --max-replicas 20 \
  --scale-rule-name queue-rule \
  --scale-rule-type azure-servicebus \
  --scale-rule-metadata queueName=pharmabot-jobs \
                         queueLength=5 \
                         namespace=pharmabot-servicebus \
  --scale-rule-auth trigger=servicebus-connection \
                     secretRef=servicebus-conn

queueLength=5: Add a replica for every 5 messages in the queue. 100 messages → 20 replicas.


Scaling in a Container App YAML

You can also define scaling in a containerapp.yaml file (better for GitOps):

YAML
properties:
  template:
    scale:
      minReplicas: 1
      maxReplicas: 10
      rules:
        - name: http-rule
          http:
            metadata:
              concurrentRequests: "10"
        - name: cpu-rule
          custom:
            type: cpu
            metadata:
              type: Utilization
              value: "70"  # Scale up if CPU over 70%

Deploy:

Bash
az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --yaml containerapp.yaml

Understanding Scale-Down Cooldown

By default, Container Apps waits 5 minutes of idle before scaling down. Tune this to avoid flapping:

YAML
scale:
  minReplicas: 1
  maxReplicas: 10
  cooldownPeriod: 300  # 5 minutes before scaling down
  pollingInterval: 30  # check every 30 seconds

LLM-specific consideration: LLM requests can be slow (30s+). A scale-down during a long inference request would kill it mid-stream. The cooldown ensures the replica stays alive until requests finish.


Cost Impact of Scale to Zero

Example: pharmabot staging environment, 2 vCPU / 4GB RAM

| Config | Monthly cost | |---|---| | Always 1 replica running | ~$45/month | | Scale to zero, 2h/day active | ~$3/month | | Scale to zero, 8h/day active | ~$11/month |

For 5 staging environments across a team, scale-to-zero saves ~$200/month.


Monitoring Scaling Events

KUSTO
// View scaling events in Log Analytics
ContainerAppSystemLogs_CL
| where Reason_s in ("ScalingReplicaAdded", "ScalingReplicaRemoved")
| where TimeGenerated > ago(24h)
| project TimeGenerated, Reason_s, Count_d, ContainerApp_s
| order by TimeGenerated desc

Checkpoint

Check your current scaling configuration:

Bash
az containerapp show \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --query "properties.template.scale" \
  -o yaml

Expected output for production:

YAML
minReplicas: 1
maxReplicas: 10
rules:
  - name: http-rule
    http:
      metadata:
        concurrentRequests: '10'

Then run a load test to verify it scales:

Bash
# Install hey (HTTP load tester)
hey -n 1000 -c 50 http://pharmabot.example.com/api/chat -m POST -H "Content-Type: application/json" -d '{"message":"test"}'

Watch the replica count increase in the Azure Portal while the load test runs.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.