Scale to Zero with Azure Container Apps
Configure Azure Container Apps to automatically scale your LLM service based on HTTP traffic, KEDA rules, and custom metrics — including scaling to zero replicas when idle.
Why Scale to Zero Matters for LLM Services
A standard VM or AKS cluster costs money 24/7 whether it serves requests or not. Azure Container Apps uses KEDA (Kubernetes Event-Driven Autoscaling) to scale replicas from 0 to N based on demand.
For LLM services this is powerful: a dev or staging environment that gets no traffic at night scales to zero — zero cost, zero waste.
But there's a catch: scaling from 0 to 1 takes 15–30 seconds (cold start: pull image, start Python, load dependencies). For user-facing APIs, this is unacceptable. Solution: scale to minimum 1 in production, zero only in dev/staging.
Azure Container Apps Scaling Architecture
HTTP traffic arrives
│
▼
KEDA HTTP scaler detects requests in queue
│
▼
Azure Container Apps adds a replica (0→1, 1→2, ...)
│
▼
Container starts, health check passes
│
▼
Traffic routed to new replicaBasic HTTP-Based Scaling
az containerapp update \
--name pharmabot \
--resource-group pharmabot-rg \
--min-replicas 1 \
--max-replicas 10 \
--scale-rule-name http-rule \
--scale-rule-type http \
--scale-rule-http-concurrency 10--scale-rule-http-concurrency 10: Add a replica when more than 10 concurrent requests are being handled by a single replica.
Resulting behaviour:
- 1 replica handles 0–10 concurrent requests
- At 11 concurrent requests: scales to 2 replicas
- At 21 concurrent requests: scales to 3 replicas
- When idle: scales back down to 1 (not 0, because min-replicas=1)
Scale to Zero (Dev/Staging)
az containerapp update \
--name pharmabot-staging \
--resource-group pharmabot-rg \
--min-replicas 0 \ # Allow scaling to zero
--max-replicas 5 \
--scale-rule-name http-rule \
--scale-rule-type http \
--scale-rule-http-concurrency 5When no traffic arrives for 5 minutes, the app scales to 0. The next request triggers a cold start.
Warming up a cold start:
# Add a startup probe with generous timeout
az containerapp ingress update \
--name pharmabot-staging \
--resource-group pharmabot-rgIn containerapp.yaml:
probes:
- type: startup
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 12 # 60 seconds total before marking failedCustom Scaling: Scale on Queue Depth
If your LLM service processes jobs from an Azure Service Bus queue:
az containerapp update \
--name pharmabot-worker \
--resource-group pharmabot-rg \
--min-replicas 0 \
--max-replicas 20 \
--scale-rule-name queue-rule \
--scale-rule-type azure-servicebus \
--scale-rule-metadata queueName=pharmabot-jobs \
queueLength=5 \
namespace=pharmabot-servicebus \
--scale-rule-auth trigger=servicebus-connection \
secretRef=servicebus-connqueueLength=5: Add a replica for every 5 messages in the queue. 100 messages → 20 replicas.
Scaling in a Container App YAML
You can also define scaling in a containerapp.yaml file (better for GitOps):
properties:
template:
scale:
minReplicas: 1
maxReplicas: 10
rules:
- name: http-rule
http:
metadata:
concurrentRequests: "10"
- name: cpu-rule
custom:
type: cpu
metadata:
type: Utilization
value: "70" # Scale up if CPU over 70%Deploy:
az containerapp update \
--name pharmabot \
--resource-group pharmabot-rg \
--yaml containerapp.yamlUnderstanding Scale-Down Cooldown
By default, Container Apps waits 5 minutes of idle before scaling down. Tune this to avoid flapping:
scale:
minReplicas: 1
maxReplicas: 10
cooldownPeriod: 300 # 5 minutes before scaling down
pollingInterval: 30 # check every 30 secondsLLM-specific consideration: LLM requests can be slow (30s+). A scale-down during a long inference request would kill it mid-stream. The cooldown ensures the replica stays alive until requests finish.
Cost Impact of Scale to Zero
Example: pharmabot staging environment, 2 vCPU / 4GB RAM
| Config | Monthly cost | |---|---| | Always 1 replica running | ~$45/month | | Scale to zero, 2h/day active | ~$3/month | | Scale to zero, 8h/day active | ~$11/month |
For 5 staging environments across a team, scale-to-zero saves ~$200/month.
Monitoring Scaling Events
// View scaling events in Log Analytics
ContainerAppSystemLogs_CL
| where Reason_s in ("ScalingReplicaAdded", "ScalingReplicaRemoved")
| where TimeGenerated > ago(24h)
| project TimeGenerated, Reason_s, Count_d, ContainerApp_s
| order by TimeGenerated descCheckpoint
Check your current scaling configuration:
az containerapp show \
--name pharmabot \
--resource-group pharmabot-rg \
--query "properties.template.scale" \
-o yamlExpected output for production:
minReplicas: 1
maxReplicas: 10
rules:
- name: http-rule
http:
metadata:
concurrentRequests: '10'Then run a load test to verify it scales:
# Install hey (HTTP load tester)
hey -n 1000 -c 50 http://pharmabot.example.com/api/chat -m POST -H "Content-Type: application/json" -d '{"message":"test"}'Watch the replica count increase in the Azure Portal while the load test runs.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.