Rollback Strategy for LLM Deployments
LLM deployments can fail in ways that are invisible until users complain. Learn concrete rollback strategies for bad prompts, model upgrades, and embedding schema changes — with exact Azure CLI commands.
Why LLM Rollback Is Harder
A standard API rollback is simple: redeploy the old image. LLM service rollbacks have hidden complexity:
- Prompt changes are invisible — the new code looks identical to the old; only the system prompt string changed
- Model version upgrades change output format — GPT-4o-2025-11 may return JSON differently from GPT-4o-2024-05
- Embedding model changes break the vector store — if you upgrade from text-embedding-3-small to text-embedding-3-large, every existing embedding in your vector DB is incompatible
- Database migrations don't roll back automatically — a schema migration applied before the bad deploy is still there after the code rollback
The 5 Types of LLM Rollback
Type 1: Container Image Rollback (Fastest — 30 seconds)
If the problem is in the application code (bug in the API, broken endpoint), roll back to the previous container image:
# List revisions to find the previous healthy one
az containerapp revision list \
--name pharmabot \
--resource-group pharmabot-rg \
--query "[].{name:name, active:properties.active, traffic:properties.trafficWeight, created:properties.createdTime}" \
-o table
# Shift all traffic to the previous revision
az containerapp ingress traffic set \
--name pharmabot \
--resource-group pharmabot-rg \
--revision-weight pharmabot--v1=100 pharmabot--v2=0This is instant — no new container starts, no health checks. Traffic shifts within 30 seconds.
Type 2: Prompt Rollback (Git-based)
If the bad deployment only changed a system prompt:
# Find the last good commit
git log --oneline -10
# Roll back to the commit before the prompt change
git revert HEAD --no-edit
git push origin mainThe CI/CD pipeline deploys the reverted code. Or, if you store prompts in a database:
# Store prompts in DB with version numbers
class SystemPrompt(Base):
__tablename__ = "system_prompts"
version = Column(Integer, primary_key=True)
name = Column(String)
content = Column(Text)
is_active = Column(Boolean, default=False)
created_at = Column(DateTime)
# Roll back to previous prompt version without a code deploy
await db.execute(
"UPDATE system_prompts SET is_active = FALSE WHERE name = 'drug_info'"
)
await db.execute(
"UPDATE system_prompts SET is_active = TRUE WHERE name = 'drug_info' AND version = 3"
)Prompt-in-DB rollback is the fastest possible — zero deploy required.
Type 3: Model Version Rollback
Azure OpenAI deployments are versioned. If you upgraded from gpt-4o-2024-05-13 to gpt-4o-2025-01-21 and the new version behaves differently:
# In Azure Portal: Azure OpenAI → Deployments → Edit Deployment
# Change model version back to the previous one
# Or via Azure CLI:
az cognitiveservices account deployment create \
--name pharmabot-openai \
--resource-group pharmabot-rg \
--deployment-name gpt4o-stable \
--model-name gpt-4o \
--model-version "2024-05-13" \ # Pin to specific version
--model-format OpenAI \
--sku-capacity 100 \
--sku-name StandardPrevention: Always pin model versions explicitly. Never use latest. When upgrading, run a shadow evaluation: route 5% of traffic to the new model version and compare output quality before full cutover.
Type 4: Embedding Model Rollback (Most Dangerous)
If you switched embedding models, every vector in your search index is now incompatible. You cannot just roll back the code — the old model will generate vectors in a different space than what's stored.
Plan A — Keep both indexes running during migration:
# During migration: use old model for search, new model for new ingestion
async def hybrid_search(query: str, phase: str = "migration"):
if phase == "migration":
# Search old index
old_results = await old_search_index.search(
await old_embedder.embed(query)
)
# Also search new index (gradually built)
new_results = await new_search_index.search(
await new_embedder.embed(query)
)
# Merge results
return merge_results(old_results, new_results)
else:
return await new_search_index.search(await new_embedder.embed(query))Plan B — Emergency rollback (accept degraded quality):
# Roll back application code to use old embedding model
git revert HEAD --no-edit
git push origin main
# The vector store still has old embeddings — search works again
# New documents ingested during the bad period are lost
# Accept this as data loss and re-ingest from backupPrevention: Never migrate embedding models without a full re-index of the old data first. Keep the old index alive for 24 hours after the new one is fully populated and validated.
Type 5: Database Schema Rollback
If a database migration was applied alongside a bad deploy:
First priority: Roll back the application code immediately (traffic rollback). The old schema is usually backwards-compatible with the old code.
If the schema change was breaking:
-- EF Core: generate a "down" migration script
-- This is why you should test the Down() method in every migration
-- Example: reverse an AddColumn migration
ALTER TABLE drug_requests DROP COLUMN IF EXISTS new_column;Or restore from a pre-deploy snapshot (Azure Database for PostgreSQL supports point-in-time restore):
az postgres flexible-server restore \
--source-server pharmabot-db \
--name pharmabot-db-restored \
--restore-time "2026-05-15T14:30:00Z" # Before the bad deployThe Rollback Decision Tree
Bad deployment detected
│
▼
Is it a production emergency (data loss, safety issue)?
│ Yes │ No
▼ ▼
Rollback NOW, investigate later Investigate first
(traffic shift: 30 seconds) (is it actually a bug?)
│
▼
What changed?
│ │ │ │
Code Prompt Model Schema
│ │ Version │
▼ ▼ │ ▼
az traffic Revert Redeploy Point-in-time
set prompt Azure OAI restore
in DB deployment5-Point Rollback Checklist
Before every production deployment, verify:
- [ ] Previous revision is still alive —
az containerapp revision listshows the old revision still active - [ ] Rollback command is documented — exact CLI command in the runbook, not "we'll figure it out"
- [ ] Prompt version is logged — every LLM call logs which prompt version was used, so you know which requests used the bad prompt
- [ ] Database backup taken —
az postgres flexible-server backup listshows a recent backup - [ ] Rollback tested — you've actually done a rollback drill in staging, not just read about it
Automated Rollback in GitHub Actions
- name: Deploy and verify (with auto-rollback)
run: |
# Deploy new revision
az containerapp update \
--name pharmabot \
--resource-group pharmabot-rg \
--image ${{ env.NEW_IMAGE }} \
--revision-suffix ${{ github.sha }} \
--traffic-weight latest=0 previous=100
# Wait for health check
HEALTHY=false
for i in $(seq 1 12); do
CODE=$(curl -s -o /dev/null -w "%{http_code}" \
https://$REVISION_URL/health/ready || echo 000)
if [ "$CODE" = "200" ]; then
HEALTHY=true
break
fi
sleep 5
done
if [ "$HEALTHY" = "true" ]; then
echo "Deployment healthy. Shifting traffic."
az containerapp ingress traffic set \
--name pharmabot --resource-group pharmabot-rg \
--revision-weight latest=100
else
echo "Deployment unhealthy. AUTO-ROLLBACK."
az containerapp ingress traffic set \
--name pharmabot --resource-group pharmabot-rg \
--revision-weight previous=100 latest=0
exit 1
fiCheckpoint
Practice a rollback drill in staging right now:
# 1. Note current revision
az containerapp revision list --name pharmabot-staging --resource-group pharmabot-rg -o table
# 2. Deploy a "bad" version (any image with a different tag)
az containerapp update --name pharmabot-staging --resource-group pharmabot-rg \
--image pharmabotacr.azurecr.io/pharmabot:bad-test
# 3. Roll back
az containerapp ingress traffic set --name pharmabot-staging --resource-group pharmabot-rg \
--revision-weight previous=100 latest=0
# 4. Verify the previous version is serving traffic
curl https://pharmabot-staging.azurecontainerapps.io/healthIf you can complete this drill in under 2 minutes, you're ready for a real production rollback.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.