Cost Optimization: Caching, Batching, Model Routing

The Cost Problem

GPT-4o at $5/million input tokens sounds cheap. But a RAG system with a 2,000-token system prompt, 800-token retrieved context, and 200-token user query = 3,000 tokens per request. At 100,000 requests/day:

100,000 requests × 3,000 tokens = 300M tokens/day
300M × $5 / 1M = $1,500/day = $45,000/month

Three techniques cut this dramatically: semantic caching, request batching, and model routing.

Technique 1: Semantic Caching

The idea: If two users ask "What are ibuprofen side effects?" and "What side effects does ibuprofen have?", they're the same question. Cache the answer for the embedding, not the exact string.

How It Works

User query → Embed query → Search cache (vector similarity)
     ↓                               ↓
If cache hit (score > 0.95):    If cache miss:
  Return cached response          → Call LLM
                                  → Store embedding + response in cache

Implementation with Redis + pgvector

Python

import hashlib
import numpy as np
from redis import Redis

redis = Redis.from_url("redis://localhost:6379")

async def semantic_cache_lookup(
    query: str,
    embedder,
    similarity_threshold: float = 0.95,
) -> str | None:
    # Embed the query
    query_embedding = await embedder.embed(query)
    
    # Search Redis for similar cached queries
    # Using Redis with vector search (Redis Stack)
    results = redis.execute_command(
        "FT.SEARCH", "idx:cache",
        f"*=>[KNN 1 @embedding $vec AS score]",
        "PARAMS", 2, "vec", np.array(query_embedding).tobytes(),
        "SORTBY", "score",
        "LIMIT", 0, 1,
    )
    
    if results and results[0]:
        score = float(results[0][0])
        if score >= similarity_threshold:
            cached_response = redis.get(f"cache:{results[0][1]}")
            return cached_response.decode()
    
    return None

async def semantic_cache_store(
    query: str,
    response: str,
    embedder,
    ttl_seconds: int = 3600,
):
    query_embedding = await embedder.embed(query)
    cache_key = hashlib.sha256(query.encode()).hexdigest()[:16]
    
    # Store embedding + response
    redis.hset(f"cache:{cache_key}", mapping={
        "query": query,
        "response": response,
        "embedding": np.array(query_embedding).tobytes(),
    })
    redis.expire(f"cache:{cache_key}", ttl_seconds)

# Usage in your LLM pipeline
async def cached_llm_call(query: str) -> str:
    # Check cache first
    cached = await semantic_cache_lookup(query, embedder)
    if cached:
        log.info("cache_hit", query_preview=query[:50])
        return cached
    
    # Call LLM
    response = await call_azure_openai(query)
    
    # Store in cache
    await semantic_cache_store(query, response, embedder)
    
    return response

Cache hit rates in production: FAQ-heavy apps (customer support, drug info lookups) see 40–70% cache hit rates. At 60% cache hit rate, token costs drop by 60%.

Technique 2: Prompt Caching

Azure OpenAI supports prompt prefix caching — if multiple requests share a common prefix (your system prompt), the prefix tokens are cached server-side and billed at 50% of normal input price.

Requirement: System prompt must be at least 1,024 tokens and appear at the start of the messages array.

Python

# Structure messages so the system prompt is always first
# and consistent (never vary it between requests)
messages = [
    {
        "role": "system",
        "content": SYSTEM_PROMPT,  # This gets cached server-side
    },
    {
        "role": "user",
        "content": user_query,  # This is NOT cached (different each time)
    }
]

Result: If your system prompt is 2,000 tokens and every request reuses it, Azure caches those tokens. You pay 50% of $5/M = $2.50/M for those tokens.

Technique 3: Request Batching (Embeddings)

Embedding API calls cost $0.10/million tokens for text-embedding-3-small. If you're embedding 100 documents one-by-one, you make 100 API calls. Batch them into one:

Python

# ❌ Slow and expensive — 100 API calls
embeddings = []
for doc in documents:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=doc,
    )
    embeddings.append(response.data[0].embedding)

# ✅ Fast — 1 API call
response = await client.embeddings.create(
    model="text-embedding-3-small",
    input=documents,  # Pass the whole list (max 2048 inputs)
)
embeddings = [item.embedding for item in response.data]

For bulk document ingestion (seeding a knowledge base), batching reduces time from hours to minutes.

Technique 4: Model Routing

Not every query needs GPT-4o. Route simple queries to cheaper models:

Query comes in
      │
      ▼
Classify: is this complex? (requires reasoning, multi-step, code)
      │
  Yes │              No │
      ▼                 ▼
   GPT-4o           GPT-4o-mini
($5/M input)      ($0.15/M input)
                   33× cheaper

Implementation:

Python

SIMPLE_QUERY_PATTERNS = [
    r"what is \w+",
    r"define \w+",
    r"side effects of \w+",
    r"dose of \w+",
]

def classify_query_complexity(query: str) -> str:
    query_lower = query.lower()
    
    # Simple pattern matching — can also use a tiny classifier
    for pattern in SIMPLE_QUERY_PATTERNS:
        if re.search(pattern, query_lower):
            return "simple"
    
    # Long queries or those with conjunctions are likely complex
    if len(query.split()) > 20:
        return "complex"
    if any(w in query_lower for w in ["compare", "difference", "explain why", "how does"]):
        return "complex"
    
    return "simple"

async def routed_llm_call(query: str, messages: list) -> str:
    complexity = classify_query_complexity(query)
    
    model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"
    
    log.info("model_selected", model=model, complexity=complexity)
    
    return await call_azure_openai(messages, model=model)

Typical savings: If 70% of queries are "simple" (FAQ-style lookups), model routing cuts costs by ~60% on those queries.

Technique 5: Response Streaming + Early Termination

For use cases where users read and then stop (documentation lookup), allow early termination:

Python

async def stream_with_cancel(messages: list, request):
    async for chunk in await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    ):
        if await request.is_disconnected():
            break  # User closed the browser tab — stop generating
        
        content = chunk.choices[0].delta.content or ""
        yield f"data: {content}\n\n"

If users read 30% of long responses and close the tab, you save ~70% of completion tokens on those requests.

Cost Optimization Summary

| Technique | Complexity | Typical Saving | |---|---|---| | Semantic caching | Medium | 40–70% | | Prompt prefix caching | Low | 25–50% on system prompt | | Embedding batching | Low | No cost saving, but speed improvement | | Model routing | Medium | 40–70% on simple queries | | Early termination | Medium | 20–40% on completion tokens |

Apply all five and you can cut costs from $45,000/month to under $10,000/month for the same request volume.

Checkpoint

Add a cost log field to every LLM call and run a week of traffic:

Python

log.info("llm_cost", 
    cost_usd=calculate_cost(response.usage),
    model=model,
    cache_hit=False,
)

After a week, query your logs:

KUSTO

customEvents
| where name == "llm_cost"
| summarize 
    total_cost = sum(todouble(customDimensions["cost_usd"])),
    cache_hits = countif(customDimensions["cache_hit"] == "True")
    by bin(timestamp, 1d)

This gives you your baseline. Then implement semantic caching and measure the drop.

Cost Optimization: Caching, Batching, Model Routing

The Cost Problem

Technique 1: Semantic Caching

How It Works

Implementation with Redis + pgvector

Technique 2: Prompt Caching

Technique 3: Request Batching (Embeddings)

Technique 4: Model Routing

Technique 5: Response Streaming + Early Termination

Cost Optimization Summary

Checkpoint

Enjoyed this article?

Leave a comment