Learnixo

LLMOps & Deployment · Lesson 8 of 16

Rollback Strategy for LLM Deployments

Why LLM Rollback Is Harder

A standard API rollback is simple: redeploy the old image. LLM service rollbacks have hidden complexity:

  1. Prompt changes are invisible — the new code looks identical to the old; only the system prompt string changed
  2. Model version upgrades change output format — GPT-4o-2025-11 may return JSON differently from GPT-4o-2024-05
  3. Embedding model changes break the vector store — if you upgrade from text-embedding-3-small to text-embedding-3-large, every existing embedding in your vector DB is incompatible
  4. Database migrations don't roll back automatically — a schema migration applied before the bad deploy is still there after the code rollback

The 5 Types of LLM Rollback

Type 1: Container Image Rollback (Fastest — 30 seconds)

If the problem is in the application code (bug in the API, broken endpoint), roll back to the previous container image:

Bash
# List revisions to find the previous healthy one
az containerapp revision list \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --query "[].{name:name, active:properties.active, traffic:properties.trafficWeight, created:properties.createdTime}" \
  -o table

# Shift all traffic to the previous revision
az containerapp ingress traffic set \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --revision-weight pharmabot--v1=100 pharmabot--v2=0

This is instant — no new container starts, no health checks. Traffic shifts within 30 seconds.


Type 2: Prompt Rollback (Git-based)

If the bad deployment only changed a system prompt:

Bash
# Find the last good commit
git log --oneline -10

# Roll back to the commit before the prompt change
git revert HEAD --no-edit
git push origin main

The CI/CD pipeline deploys the reverted code. Or, if you store prompts in a database:

Python
# Store prompts in DB with version numbers
class SystemPrompt(Base):
    __tablename__ = "system_prompts"
    version = Column(Integer, primary_key=True)
    name = Column(String)
    content = Column(Text)
    is_active = Column(Boolean, default=False)
    created_at = Column(DateTime)

# Roll back to previous prompt version without a code deploy
await db.execute(
    "UPDATE system_prompts SET is_active = FALSE WHERE name = 'drug_info'"
)
await db.execute(
    "UPDATE system_prompts SET is_active = TRUE WHERE name = 'drug_info' AND version = 3"
)

Prompt-in-DB rollback is the fastest possible — zero deploy required.


Type 3: Model Version Rollback

Azure OpenAI deployments are versioned. If you upgraded from gpt-4o-2024-05-13 to gpt-4o-2025-01-21 and the new version behaves differently:

Bash
# In Azure Portal: Azure OpenAI  Deployments  Edit Deployment
# Change model version back to the previous one

# Or via Azure CLI:
az cognitiveservices account deployment create \
  --name pharmabot-openai \
  --resource-group pharmabot-rg \
  --deployment-name gpt4o-stable \
  --model-name gpt-4o \
  --model-version "2024-05-13" \   # Pin to specific version
  --model-format OpenAI \
  --sku-capacity 100 \
  --sku-name Standard

Prevention: Always pin model versions explicitly. Never use latest. When upgrading, run a shadow evaluation: route 5% of traffic to the new model version and compare output quality before full cutover.


Type 4: Embedding Model Rollback (Most Dangerous)

If you switched embedding models, every vector in your search index is now incompatible. You cannot just roll back the code — the old model will generate vectors in a different space than what's stored.

Plan A — Keep both indexes running during migration:

Python
# During migration: use old model for search, new model for new ingestion
async def hybrid_search(query: str, phase: str = "migration"):
    if phase == "migration":
        # Search old index
        old_results = await old_search_index.search(
            await old_embedder.embed(query)
        )
        # Also search new index (gradually built)
        new_results = await new_search_index.search(
            await new_embedder.embed(query)
        )
        # Merge results
        return merge_results(old_results, new_results)
    else:
        return await new_search_index.search(await new_embedder.embed(query))

Plan B — Emergency rollback (accept degraded quality):

Bash
# Roll back application code to use old embedding model
git revert HEAD --no-edit
git push origin main

# The vector store still has old embeddings  search works again
# New documents ingested during the bad period are lost
# Accept this as data loss and re-ingest from backup

Prevention: Never migrate embedding models without a full re-index of the old data first. Keep the old index alive for 24 hours after the new one is fully populated and validated.


Type 5: Database Schema Rollback

If a database migration was applied alongside a bad deploy:

First priority: Roll back the application code immediately (traffic rollback). The old schema is usually backwards-compatible with the old code.

If the schema change was breaking:

SQL
-- EF Core: generate a "down" migration script
-- This is why you should test the Down() method in every migration

-- Example: reverse an AddColumn migration
ALTER TABLE drug_requests DROP COLUMN IF EXISTS new_column;

Or restore from a pre-deploy snapshot (Azure Database for PostgreSQL supports point-in-time restore):

Bash
az postgres flexible-server restore \
  --source-server pharmabot-db \
  --name pharmabot-db-restored \
  --restore-time "2026-05-15T14:30:00Z"  # Before the bad deploy

The Rollback Decision Tree

Bad deployment detected
        │
        ▼
Is it a production emergency (data loss, safety issue)?
        │ Yes                              │ No
        ▼                                 ▼
Rollback NOW, investigate later    Investigate first
(traffic shift: 30 seconds)        (is it actually a bug?)
        │
        ▼
What changed?
   │           │           │           │
  Code      Prompt      Model       Schema
   │           │      Version         │
   ▼           ▼           │          ▼
az traffic   Revert     Redeploy  Point-in-time
  set        prompt     Azure OAI  restore
             in DB      deployment

5-Point Rollback Checklist

Before every production deployment, verify:

  • [ ] Previous revision is still aliveaz containerapp revision list shows the old revision still active
  • [ ] Rollback command is documented — exact CLI command in the runbook, not "we'll figure it out"
  • [ ] Prompt version is logged — every LLM call logs which prompt version was used, so you know which requests used the bad prompt
  • [ ] Database backup takenaz postgres flexible-server backup list shows a recent backup
  • [ ] Rollback tested — you've actually done a rollback drill in staging, not just read about it

Automated Rollback in GitHub Actions

YAML
- name: Deploy and verify (with auto-rollback)
  run: |
    # Deploy new revision
    az containerapp update \
      --name pharmabot \
      --resource-group pharmabot-rg \
      --image ${{ env.NEW_IMAGE }} \
      --revision-suffix ${{ github.sha }} \
      --traffic-weight latest=0 previous=100
    
    # Wait for health check
    HEALTHY=false
    for i in $(seq 1 12); do
      CODE=$(curl -s -o /dev/null -w "%{http_code}" \
        https://$REVISION_URL/health/ready || echo 000)
      if [ "$CODE" = "200" ]; then
        HEALTHY=true
        break
      fi
      sleep 5
    done
    
    if [ "$HEALTHY" = "true" ]; then
      echo "Deployment healthy. Shifting traffic."
      az containerapp ingress traffic set \
        --name pharmabot --resource-group pharmabot-rg \
        --revision-weight latest=100
    else
      echo "Deployment unhealthy. AUTO-ROLLBACK."
      az containerapp ingress traffic set \
        --name pharmabot --resource-group pharmabot-rg \
        --revision-weight previous=100 latest=0
      exit 1
    fi

Checkpoint

Practice a rollback drill in staging right now:

Bash
# 1. Note current revision
az containerapp revision list --name pharmabot-staging --resource-group pharmabot-rg -o table

# 2. Deploy a "bad" version (any image with a different tag)
az containerapp update --name pharmabot-staging --resource-group pharmabot-rg \
  --image pharmabotacr.azurecr.io/pharmabot:bad-test

# 3. Roll back
az containerapp ingress traffic set --name pharmabot-staging --resource-group pharmabot-rg \
  --revision-weight previous=100 latest=0

# 4. Verify the previous version is serving traffic
curl https://pharmabot-staging.azurecontainerapps.io/health

If you can complete this drill in under 2 minutes, you're ready for a real production rollback.