Testing LLM Services in CI: Mocks and Fixtures

The Core Problem

Testing LLM services in CI is fundamentally different from testing a regular REST API. You face three problems that do not exist elsewhere:

Problem 1: Cost. Every real call to Azure OpenAI costs money. At scale — 50 developers, 10 PRs per day, 20 tests per PR — you are burning through significant budget just running CI.

Problem 2: Non-determinism. GPT-4o does not return the same response twice. A test that asserts the reply equals an exact string will fail most of the time. LLM responses are probabilistic; your tests cannot be.

Problem 3: Latency. A real Azure OpenAI call takes 2-10 seconds. A test suite with 30 LLM calls takes 60-300 seconds minimum. This kills the fast feedback loop that makes CI valuable.

The solution is a layered testing strategy: mocks for unit tests, fixture replay for integration tests, and contract tests for prompt validation.

Strategy 1: Mock the Azure OpenAI Client

Create a MockAzureOpenAI class that mirrors the real API surface but returns deterministic responses:

Python

# pharmabot/testing/mock_openai.py
from dataclasses import dataclass
from typing import AsyncIterator

@dataclass
class MockChoice:
    message: "MockMessage"
    finish_reason: str = "stop"

@dataclass
class MockMessage:
    content: str
    role: str = "assistant"

@dataclass
class MockUsage:
    prompt_tokens: int = 100
    completion_tokens: int = 50
    total_tokens: int = 150

@dataclass
class MockCompletion:
    choices: list
    usage: MockUsage
    model: str = "gpt-4o"

class MockChatCompletions:
    def __init__(self, responses: dict[str, str] = None):
        self._responses = responses or {}
        self._default = "This is a mock response for testing."
    
    async def create(self, messages: list, model: str = "gpt-4o", **kwargs):
        # Find the user message to match against stored responses
        user_msg = next(
            (m["content"] for m in messages if m["role"] == "user"),
            ""
        )
        
        # Look for a matching response, fall back to default
        response_text = self._default
        for key, value in self._responses.items():
            if key.lower() in user_msg.lower():
                response_text = value
                break
        
        return MockCompletion(
            choices=[MockChoice(message=MockMessage(content=response_text))],
            usage=MockUsage(),
        )

class MockAzureOpenAI:
    def __init__(self, responses: dict[str, str] = None):
        self.chat = type("Chat", (), {
            "completions": MockChatCompletions(responses)
        })()

Strategy 2: Environment Flag Swap

Swap the real client for the mock based on an environment variable:

Python

# pharmabot/dependencies.py
import os
from openai import AsyncAzureOpenAI
from pharmabot.testing.mock_openai import MockAzureOpenAI

def get_openai_client():
    if os.getenv("MOCK_AZURE") == "true":
        return MockAzureOpenAI(responses={
            "ibuprofen": "Ibuprofen is an NSAID used for pain and inflammation. Common side effects include stomach upset and headache.",
            "warfarin": "Warfarin is an anticoagulant. It should not be combined with ibuprofen due to increased bleeding risk.",
            "metformin": "Metformin is a biguanide used for type 2 diabetes. It works by reducing hepatic glucose production.",
        })
    
    return AsyncAzureOpenAI(
        api_key=os.environ["AZURE_OPENAI_KEY"],
        api_version="2024-10-21",
        azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    )

In FastAPI, use dependency injection:

Python

from fastapi import Depends

async def get_client():
    return get_openai_client()

@app.post("/api/chat")
async def chat(request: ChatRequest, client = Depends(get_client)):
    ...

In tests, override the dependency:

Python

# tests/conftest.py
import pytest
from fastapi.testclient import TestClient
from pharmabot.main import app
from pharmabot.testing.mock_openai import MockAzureOpenAI
from pharmabot.dependencies import get_client

@pytest.fixture
def mock_client():
    return MockAzureOpenAI(responses={
        "ibuprofen": "Ibuprofen is a pain reliever and fever reducer.",
    })

@pytest.fixture
def test_client(mock_client):
    app.dependency_overrides[get_client] = lambda: mock_client
    yield TestClient(app)
    app.dependency_overrides.clear()

Strategy 3: Fixture-Based Testing (Record & Replay)

For more realistic tests, record real LLM responses once and replay them:

Python

# Using pytest-recording (VCR.py integration)
# pip install pytest-recording vcrpy

# tests/test_chat.py
import pytest

@pytest.mark.vcr()   # Records on first run, replays on subsequent runs
def test_ibuprofen_query(test_client):
    response = test_client.post("/api/chat", json={
        "message": "What are the side effects of ibuprofen?"
    })
    assert response.status_code == 200
    assert "side effects" in response.json()["answer"].lower()

The first time this test runs (with --record-mode=new_episodes), it calls the real Azure OpenAI and saves the response to a .yaml cassette file in tests/cassettes/. Every subsequent run replays the cassette — no API calls, no cost.

Bash

# Record once
pytest tests/test_chat.py --record-mode=new_episodes

# Replay always (CI)
pytest tests/test_chat.py  # default: replays from cassette

Strategy 4: Contract Tests

Contract tests verify that your prompt produces the right format of response, not a specific exact answer:

Python

# tests/test_contracts.py
import pytest
import json
from pharmabot.agents import drug_info_agent

@pytest.mark.asyncio
async def test_drug_info_response_format(mock_client):
    """Verify the agent returns the expected JSON structure."""
    result = await drug_info_agent.run(
        query="What is ibuprofen?",
        client=mock_client,
    )
    
    # Don't test the content — test the shape
    assert isinstance(result, dict)
    assert "drug_name" in result
    assert "indication" in result
    assert "side_effects" in result
    assert isinstance(result["side_effects"], list)

@pytest.mark.asyncio
async def test_interaction_checker_severity_field(mock_client):
    """Verify interaction checker always includes severity."""
    result = await interaction_agent.run(
        drug_a="ibuprofen", drug_b="warfarin",
        client=mock_client,
    )
    
    assert "severity" in result
    assert result["severity"] in ("low", "medium", "high", "contraindicated")

Contract tests catch prompt regressions: if you change the prompt and the model stops returning severity, the test fails immediately.

What NOT to Mock

Not everything should be mocked:

| Component | Mock in CI? | Reason | |---|---|---| | Azure OpenAI | Yes | Expensive, slow, non-deterministic | | Azure AI Search | Yes (or use test index) | Slow, needs provisioned resource | | Redis | No — use real Redis in CI | Fast, easy to spin up in Docker | | PostgreSQL | No — use Testcontainers | Fast, ensures real DB behavior | | Your business logic | Never | This IS what you're testing | | Your RAG retriever | No | Test real retrieval logic |

Running Tests in CI

In .github/workflows/deploy.yml:

YAML

- name: Run tests
  run: pytest tests/ -v --tb=short -x
  env:
    MOCK_AZURE: "true"
    REDIS_URL: "redis://localhost:6379"
    DATABASE_URL: "postgresql://test:test@localhost:5432/testdb"

services:
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
  postgres:
    image: postgres:16-alpine
    env:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    ports: ["5432:5432"]
    options: --health-cmd pg_isready --health-retries 5

Checkpoint

Run your test suite with the mock flag:

Bash

MOCK_AZURE=true pytest tests/ -v

All tests should pass in under 10 seconds, with zero real API calls. Verify no real calls are made:

Bash

# Check network calls during tests
MOCK_AZURE=true pytest tests/ -v --capture=sys 2>&1 | grep "openai.azure.com"
# Should return no results

Testing LLM Services in CI: Mocks and Fixtures

The Core Problem

Strategy 1: Mock the Azure OpenAI Client

Strategy 2: Environment Flag Swap

Strategy 3: Fixture-Based Testing (Record & Replay)

Strategy 4: Contract Tests

What NOT to Mock

Running Tests in CI

Checkpoint

Enjoyed this article?

Leave a comment