Back to blog
AI Systemsintermediate

Rollback Strategy for LLM Deployments

LLM deployments can fail in ways that are invisible until users complain. Learn concrete rollback strategies for bad prompts, model upgrades, and embedding schema changes — with exact Azure CLI commands.

Asma Hafeez KhanMay 15, 20267 min read
LLMOpsRollbackAzure Container AppsDeploymentIncident Response
Share:𝕏

Why LLM Rollback Is Harder

A standard API rollback is simple: redeploy the old image. LLM service rollbacks have hidden complexity:

  1. Prompt changes are invisible — the new code looks identical to the old; only the system prompt string changed
  2. Model version upgrades change output format — GPT-4o-2025-11 may return JSON differently from GPT-4o-2024-05
  3. Embedding model changes break the vector store — if you upgrade from text-embedding-3-small to text-embedding-3-large, every existing embedding in your vector DB is incompatible
  4. Database migrations don't roll back automatically — a schema migration applied before the bad deploy is still there after the code rollback

The 5 Types of LLM Rollback

Type 1: Container Image Rollback (Fastest — 30 seconds)

If the problem is in the application code (bug in the API, broken endpoint), roll back to the previous container image:

Bash
# List revisions to find the previous healthy one
az containerapp revision list \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --query "[].{name:name, active:properties.active, traffic:properties.trafficWeight, created:properties.createdTime}" \
  -o table

# Shift all traffic to the previous revision
az containerapp ingress traffic set \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --revision-weight pharmabot--v1=100 pharmabot--v2=0

This is instant — no new container starts, no health checks. Traffic shifts within 30 seconds.


Type 2: Prompt Rollback (Git-based)

If the bad deployment only changed a system prompt:

Bash
# Find the last good commit
git log --oneline -10

# Roll back to the commit before the prompt change
git revert HEAD --no-edit
git push origin main

The CI/CD pipeline deploys the reverted code. Or, if you store prompts in a database:

Python
# Store prompts in DB with version numbers
class SystemPrompt(Base):
    __tablename__ = "system_prompts"
    version = Column(Integer, primary_key=True)
    name = Column(String)
    content = Column(Text)
    is_active = Column(Boolean, default=False)
    created_at = Column(DateTime)

# Roll back to previous prompt version without a code deploy
await db.execute(
    "UPDATE system_prompts SET is_active = FALSE WHERE name = 'drug_info'"
)
await db.execute(
    "UPDATE system_prompts SET is_active = TRUE WHERE name = 'drug_info' AND version = 3"
)

Prompt-in-DB rollback is the fastest possible — zero deploy required.


Type 3: Model Version Rollback

Azure OpenAI deployments are versioned. If you upgraded from gpt-4o-2024-05-13 to gpt-4o-2025-01-21 and the new version behaves differently:

Bash
# In Azure Portal: Azure OpenAI  Deployments  Edit Deployment
# Change model version back to the previous one

# Or via Azure CLI:
az cognitiveservices account deployment create \
  --name pharmabot-openai \
  --resource-group pharmabot-rg \
  --deployment-name gpt4o-stable \
  --model-name gpt-4o \
  --model-version "2024-05-13" \   # Pin to specific version
  --model-format OpenAI \
  --sku-capacity 100 \
  --sku-name Standard

Prevention: Always pin model versions explicitly. Never use latest. When upgrading, run a shadow evaluation: route 5% of traffic to the new model version and compare output quality before full cutover.


Type 4: Embedding Model Rollback (Most Dangerous)

If you switched embedding models, every vector in your search index is now incompatible. You cannot just roll back the code — the old model will generate vectors in a different space than what's stored.

Plan A — Keep both indexes running during migration:

Python
# During migration: use old model for search, new model for new ingestion
async def hybrid_search(query: str, phase: str = "migration"):
    if phase == "migration":
        # Search old index
        old_results = await old_search_index.search(
            await old_embedder.embed(query)
        )
        # Also search new index (gradually built)
        new_results = await new_search_index.search(
            await new_embedder.embed(query)
        )
        # Merge results
        return merge_results(old_results, new_results)
    else:
        return await new_search_index.search(await new_embedder.embed(query))

Plan B — Emergency rollback (accept degraded quality):

Bash
# Roll back application code to use old embedding model
git revert HEAD --no-edit
git push origin main

# The vector store still has old embeddings  search works again
# New documents ingested during the bad period are lost
# Accept this as data loss and re-ingest from backup

Prevention: Never migrate embedding models without a full re-index of the old data first. Keep the old index alive for 24 hours after the new one is fully populated and validated.


Type 5: Database Schema Rollback

If a database migration was applied alongside a bad deploy:

First priority: Roll back the application code immediately (traffic rollback). The old schema is usually backwards-compatible with the old code.

If the schema change was breaking:

SQL
-- EF Core: generate a "down" migration script
-- This is why you should test the Down() method in every migration

-- Example: reverse an AddColumn migration
ALTER TABLE drug_requests DROP COLUMN IF EXISTS new_column;

Or restore from a pre-deploy snapshot (Azure Database for PostgreSQL supports point-in-time restore):

Bash
az postgres flexible-server restore \
  --source-server pharmabot-db \
  --name pharmabot-db-restored \
  --restore-time "2026-05-15T14:30:00Z"  # Before the bad deploy

The Rollback Decision Tree

Bad deployment detected
        │
        ▼
Is it a production emergency (data loss, safety issue)?
        │ Yes                              │ No
        ▼                                 ▼
Rollback NOW, investigate later    Investigate first
(traffic shift: 30 seconds)        (is it actually a bug?)
        │
        ▼
What changed?
   │           │           │           │
  Code      Prompt      Model       Schema
   │           │      Version         │
   ▼           ▼           │          ▼
az traffic   Revert     Redeploy  Point-in-time
  set        prompt     Azure OAI  restore
             in DB      deployment

5-Point Rollback Checklist

Before every production deployment, verify:

  • [ ] Previous revision is still aliveaz containerapp revision list shows the old revision still active
  • [ ] Rollback command is documented — exact CLI command in the runbook, not "we'll figure it out"
  • [ ] Prompt version is logged — every LLM call logs which prompt version was used, so you know which requests used the bad prompt
  • [ ] Database backup takenaz postgres flexible-server backup list shows a recent backup
  • [ ] Rollback tested — you've actually done a rollback drill in staging, not just read about it

Automated Rollback in GitHub Actions

YAML
- name: Deploy and verify (with auto-rollback)
  run: |
    # Deploy new revision
    az containerapp update \
      --name pharmabot \
      --resource-group pharmabot-rg \
      --image ${{ env.NEW_IMAGE }} \
      --revision-suffix ${{ github.sha }} \
      --traffic-weight latest=0 previous=100
    
    # Wait for health check
    HEALTHY=false
    for i in $(seq 1 12); do
      CODE=$(curl -s -o /dev/null -w "%{http_code}" \
        https://$REVISION_URL/health/ready || echo 000)
      if [ "$CODE" = "200" ]; then
        HEALTHY=true
        break
      fi
      sleep 5
    done
    
    if [ "$HEALTHY" = "true" ]; then
      echo "Deployment healthy. Shifting traffic."
      az containerapp ingress traffic set \
        --name pharmabot --resource-group pharmabot-rg \
        --revision-weight latest=100
    else
      echo "Deployment unhealthy. AUTO-ROLLBACK."
      az containerapp ingress traffic set \
        --name pharmabot --resource-group pharmabot-rg \
        --revision-weight previous=100 latest=0
      exit 1
    fi

Checkpoint

Practice a rollback drill in staging right now:

Bash
# 1. Note current revision
az containerapp revision list --name pharmabot-staging --resource-group pharmabot-rg -o table

# 2. Deploy a "bad" version (any image with a different tag)
az containerapp update --name pharmabot-staging --resource-group pharmabot-rg \
  --image pharmabotacr.azurecr.io/pharmabot:bad-test

# 3. Roll back
az containerapp ingress traffic set --name pharmabot-staging --resource-group pharmabot-rg \
  --revision-weight previous=100 latest=0

# 4. Verify the previous version is serving traffic
curl https://pharmabot-staging.azurecontainerapps.io/health

If you can complete this drill in under 2 minutes, you're ready for a real production rollback.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.