Testing LLM Services in CI: Mocks and Fixtures
Solve the core challenge of testing LLM services in CI — non-determinism, cost, and latency — using mock clients, VCR cassettes, fixture-based replay, and contract tests with pytest.
The Core Problem
Testing LLM services in CI is fundamentally different from testing a regular REST API. You face three problems that do not exist elsewhere:
Problem 1: Cost. Every real call to Azure OpenAI costs money. At scale — 50 developers, 10 PRs per day, 20 tests per PR — you are burning through significant budget just running CI.
Problem 2: Non-determinism. GPT-4o does not return the same response twice. A test that asserts the reply equals an exact string will fail most of the time. LLM responses are probabilistic; your tests cannot be.
Problem 3: Latency. A real Azure OpenAI call takes 2-10 seconds. A test suite with 30 LLM calls takes 60-300 seconds minimum. This kills the fast feedback loop that makes CI valuable.
The solution is a layered testing strategy: mocks for unit tests, fixture replay for integration tests, and contract tests for prompt validation.
Strategy 1: Mock the Azure OpenAI Client
Create a MockAzureOpenAI class that mirrors the real API surface but returns deterministic responses:
# pharmabot/testing/mock_openai.py
from dataclasses import dataclass
from typing import AsyncIterator
@dataclass
class MockChoice:
message: "MockMessage"
finish_reason: str = "stop"
@dataclass
class MockMessage:
content: str
role: str = "assistant"
@dataclass
class MockUsage:
prompt_tokens: int = 100
completion_tokens: int = 50
total_tokens: int = 150
@dataclass
class MockCompletion:
choices: list
usage: MockUsage
model: str = "gpt-4o"
class MockChatCompletions:
def __init__(self, responses: dict[str, str] = None):
self._responses = responses or {}
self._default = "This is a mock response for testing."
async def create(self, messages: list, model: str = "gpt-4o", **kwargs):
# Find the user message to match against stored responses
user_msg = next(
(m["content"] for m in messages if m["role"] == "user"),
""
)
# Look for a matching response, fall back to default
response_text = self._default
for key, value in self._responses.items():
if key.lower() in user_msg.lower():
response_text = value
break
return MockCompletion(
choices=[MockChoice(message=MockMessage(content=response_text))],
usage=MockUsage(),
)
class MockAzureOpenAI:
def __init__(self, responses: dict[str, str] = None):
self.chat = type("Chat", (), {
"completions": MockChatCompletions(responses)
})()Strategy 2: Environment Flag Swap
Swap the real client for the mock based on an environment variable:
# pharmabot/dependencies.py
import os
from openai import AsyncAzureOpenAI
from pharmabot.testing.mock_openai import MockAzureOpenAI
def get_openai_client():
if os.getenv("MOCK_AZURE") == "true":
return MockAzureOpenAI(responses={
"ibuprofen": "Ibuprofen is an NSAID used for pain and inflammation. Common side effects include stomach upset and headache.",
"warfarin": "Warfarin is an anticoagulant. It should not be combined with ibuprofen due to increased bleeding risk.",
"metformin": "Metformin is a biguanide used for type 2 diabetes. It works by reducing hepatic glucose production.",
})
return AsyncAzureOpenAI(
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)In FastAPI, use dependency injection:
from fastapi import Depends
async def get_client():
return get_openai_client()
@app.post("/api/chat")
async def chat(request: ChatRequest, client = Depends(get_client)):
...In tests, override the dependency:
# tests/conftest.py
import pytest
from fastapi.testclient import TestClient
from pharmabot.main import app
from pharmabot.testing.mock_openai import MockAzureOpenAI
from pharmabot.dependencies import get_client
@pytest.fixture
def mock_client():
return MockAzureOpenAI(responses={
"ibuprofen": "Ibuprofen is a pain reliever and fever reducer.",
})
@pytest.fixture
def test_client(mock_client):
app.dependency_overrides[get_client] = lambda: mock_client
yield TestClient(app)
app.dependency_overrides.clear()Strategy 3: Fixture-Based Testing (Record & Replay)
For more realistic tests, record real LLM responses once and replay them:
# Using pytest-recording (VCR.py integration)
# pip install pytest-recording vcrpy
# tests/test_chat.py
import pytest
@pytest.mark.vcr() # Records on first run, replays on subsequent runs
def test_ibuprofen_query(test_client):
response = test_client.post("/api/chat", json={
"message": "What are the side effects of ibuprofen?"
})
assert response.status_code == 200
assert "side effects" in response.json()["answer"].lower()The first time this test runs (with --record-mode=new_episodes), it calls the real Azure OpenAI and saves the response to a .yaml cassette file in tests/cassettes/. Every subsequent run replays the cassette — no API calls, no cost.
# Record once
pytest tests/test_chat.py --record-mode=new_episodes
# Replay always (CI)
pytest tests/test_chat.py # default: replays from cassetteStrategy 4: Contract Tests
Contract tests verify that your prompt produces the right format of response, not a specific exact answer:
# tests/test_contracts.py
import pytest
import json
from pharmabot.agents import drug_info_agent
@pytest.mark.asyncio
async def test_drug_info_response_format(mock_client):
"""Verify the agent returns the expected JSON structure."""
result = await drug_info_agent.run(
query="What is ibuprofen?",
client=mock_client,
)
# Don't test the content — test the shape
assert isinstance(result, dict)
assert "drug_name" in result
assert "indication" in result
assert "side_effects" in result
assert isinstance(result["side_effects"], list)
@pytest.mark.asyncio
async def test_interaction_checker_severity_field(mock_client):
"""Verify interaction checker always includes severity."""
result = await interaction_agent.run(
drug_a="ibuprofen", drug_b="warfarin",
client=mock_client,
)
assert "severity" in result
assert result["severity"] in ("low", "medium", "high", "contraindicated")Contract tests catch prompt regressions: if you change the prompt and the model stops returning severity, the test fails immediately.
What NOT to Mock
Not everything should be mocked:
| Component | Mock in CI? | Reason | |---|---|---| | Azure OpenAI | Yes | Expensive, slow, non-deterministic | | Azure AI Search | Yes (or use test index) | Slow, needs provisioned resource | | Redis | No — use real Redis in CI | Fast, easy to spin up in Docker | | PostgreSQL | No — use Testcontainers | Fast, ensures real DB behavior | | Your business logic | Never | This IS what you're testing | | Your RAG retriever | No | Test real retrieval logic |
Running Tests in CI
In .github/workflows/deploy.yml:
- name: Run tests
run: pytest tests/ -v --tb=short -x
env:
MOCK_AZURE: "true"
REDIS_URL: "redis://localhost:6379"
DATABASE_URL: "postgresql://test:test@localhost:5432/testdb"
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports: ["5432:5432"]
options: --health-cmd pg_isready --health-retries 5Checkpoint
Run your test suite with the mock flag:
MOCK_AZURE=true pytest tests/ -vAll tests should pass in under 10 seconds, with zero real API calls. Verify no real calls are made:
# Check network calls during tests
MOCK_AZURE=true pytest tests/ -v --capture=sys 2>&1 | grep "openai.azure.com"
# Should return no resultsFound this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.