Learnixo

LLMOps & Deployment · Lesson 13 of 16

Scale to Zero with Azure Container Apps

Why Scale to Zero Matters for LLM Services

A standard VM or AKS cluster costs money 24/7 whether it serves requests or not. Azure Container Apps uses KEDA (Kubernetes Event-Driven Autoscaling) to scale replicas from 0 to N based on demand.

For LLM services this is powerful: a dev or staging environment that gets no traffic at night scales to zero — zero cost, zero waste.

But there's a catch: scaling from 0 to 1 takes 15–30 seconds (cold start: pull image, start Python, load dependencies). For user-facing APIs, this is unacceptable. Solution: scale to minimum 1 in production, zero only in dev/staging.


Azure Container Apps Scaling Architecture

HTTP traffic arrives
        │
        ▼
KEDA HTTP scaler detects requests in queue
        │
        ▼
Azure Container Apps adds a replica (0→1, 1→2, ...)
        │
        ▼
Container starts, health check passes
        │
        ▼
Traffic routed to new replica

Basic HTTP-Based Scaling

Bash
az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --min-replicas 1 \
  --max-replicas 10 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 10

--scale-rule-http-concurrency 10: Add a replica when more than 10 concurrent requests are being handled by a single replica.

Resulting behaviour:

  • 1 replica handles 0–10 concurrent requests
  • At 11 concurrent requests: scales to 2 replicas
  • At 21 concurrent requests: scales to 3 replicas
  • When idle: scales back down to 1 (not 0, because min-replicas=1)

Scale to Zero (Dev/Staging)

Bash
az containerapp update \
  --name pharmabot-staging \
  --resource-group pharmabot-rg \
  --min-replicas 0 \     # Allow scaling to zero
  --max-replicas 5 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 5

When no traffic arrives for 5 minutes, the app scales to 0. The next request triggers a cold start.

Warming up a cold start:

Bash
# Add a startup probe with generous timeout
az containerapp ingress update \
  --name pharmabot-staging \
  --resource-group pharmabot-rg

In containerapp.yaml:

YAML
probes:
  - type: startup
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 12  # 60 seconds total before marking failed

Custom Scaling: Scale on Queue Depth

If your LLM service processes jobs from an Azure Service Bus queue:

Bash
az containerapp update \
  --name pharmabot-worker \
  --resource-group pharmabot-rg \
  --min-replicas 0 \
  --max-replicas 20 \
  --scale-rule-name queue-rule \
  --scale-rule-type azure-servicebus \
  --scale-rule-metadata queueName=pharmabot-jobs \
                         queueLength=5 \
                         namespace=pharmabot-servicebus \
  --scale-rule-auth trigger=servicebus-connection \
                     secretRef=servicebus-conn

queueLength=5: Add a replica for every 5 messages in the queue. 100 messages → 20 replicas.


Scaling in a Container App YAML

You can also define scaling in a containerapp.yaml file (better for GitOps):

YAML
properties:
  template:
    scale:
      minReplicas: 1
      maxReplicas: 10
      rules:
        - name: http-rule
          http:
            metadata:
              concurrentRequests: "10"
        - name: cpu-rule
          custom:
            type: cpu
            metadata:
              type: Utilization
              value: "70"  # Scale up if CPU over 70%

Deploy:

Bash
az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --yaml containerapp.yaml

Understanding Scale-Down Cooldown

By default, Container Apps waits 5 minutes of idle before scaling down. Tune this to avoid flapping:

YAML
scale:
  minReplicas: 1
  maxReplicas: 10
  cooldownPeriod: 300  # 5 minutes before scaling down
  pollingInterval: 30  # check every 30 seconds

LLM-specific consideration: LLM requests can be slow (30s+). A scale-down during a long inference request would kill it mid-stream. The cooldown ensures the replica stays alive until requests finish.


Cost Impact of Scale to Zero

Example: pharmabot staging environment, 2 vCPU / 4GB RAM

| Config | Monthly cost | |---|---| | Always 1 replica running | ~$45/month | | Scale to zero, 2h/day active | ~$3/month | | Scale to zero, 8h/day active | ~$11/month |

For 5 staging environments across a team, scale-to-zero saves ~$200/month.


Monitoring Scaling Events

KUSTO
// View scaling events in Log Analytics
ContainerAppSystemLogs_CL
| where Reason_s in ("ScalingReplicaAdded", "ScalingReplicaRemoved")
| where TimeGenerated > ago(24h)
| project TimeGenerated, Reason_s, Count_d, ContainerApp_s
| order by TimeGenerated desc

Checkpoint

Check your current scaling configuration:

Bash
az containerapp show \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --query "properties.template.scale" \
  -o yaml

Expected output for production:

YAML
minReplicas: 1
maxReplicas: 10
rules:
  - name: http-rule
    http:
      metadata:
        concurrentRequests: '10'

Then run a load test to verify it scales:

Bash
# Install hey (HTTP load tester)
hey -n 1000 -c 50 http://pharmabot.example.com/api/chat -m POST -H "Content-Type: application/json" -d '{"message":"test"}'

Watch the replica count increase in the Azure Portal while the load test runs.