Cost Optimization: Caching, Batching, Model Routing
Cut your LLM API costs by 60ā80% using semantic caching, request batching, and intelligent model routing. Real techniques used in production AI services.
The Cost Problem
GPT-4o at $5/million input tokens sounds cheap. But a RAG system with a 2,000-token system prompt, 800-token retrieved context, and 200-token user query = 3,000 tokens per request. At 100,000 requests/day:
100,000 requests Ć 3,000 tokens = 300M tokens/day
300M Ć $5 / 1M = $1,500/day = $45,000/monthThree techniques cut this dramatically: semantic caching, request batching, and model routing.
Technique 1: Semantic Caching
The idea: If two users ask "What are ibuprofen side effects?" and "What side effects does ibuprofen have?", they're the same question. Cache the answer for the embedding, not the exact string.
How It Works
User query ā Embed query ā Search cache (vector similarity)
ā ā
If cache hit (score > 0.95): If cache miss:
Return cached response ā Call LLM
ā Store embedding + response in cacheImplementation with Redis + pgvector
import hashlib
import numpy as np
from redis import Redis
redis = Redis.from_url("redis://localhost:6379")
async def semantic_cache_lookup(
query: str,
embedder,
similarity_threshold: float = 0.95,
) -> str | None:
# Embed the query
query_embedding = await embedder.embed(query)
# Search Redis for similar cached queries
# Using Redis with vector search (Redis Stack)
results = redis.execute_command(
"FT.SEARCH", "idx:cache",
f"*=>[KNN 1 @embedding $vec AS score]",
"PARAMS", 2, "vec", np.array(query_embedding).tobytes(),
"SORTBY", "score",
"LIMIT", 0, 1,
)
if results and results[0]:
score = float(results[0][0])
if score >= similarity_threshold:
cached_response = redis.get(f"cache:{results[0][1]}")
return cached_response.decode()
return None
async def semantic_cache_store(
query: str,
response: str,
embedder,
ttl_seconds: int = 3600,
):
query_embedding = await embedder.embed(query)
cache_key = hashlib.sha256(query.encode()).hexdigest()[:16]
# Store embedding + response
redis.hset(f"cache:{cache_key}", mapping={
"query": query,
"response": response,
"embedding": np.array(query_embedding).tobytes(),
})
redis.expire(f"cache:{cache_key}", ttl_seconds)
# Usage in your LLM pipeline
async def cached_llm_call(query: str) -> str:
# Check cache first
cached = await semantic_cache_lookup(query, embedder)
if cached:
log.info("cache_hit", query_preview=query[:50])
return cached
# Call LLM
response = await call_azure_openai(query)
# Store in cache
await semantic_cache_store(query, response, embedder)
return responseCache hit rates in production: FAQ-heavy apps (customer support, drug info lookups) see 40ā70% cache hit rates. At 60% cache hit rate, token costs drop by 60%.
Technique 2: Prompt Caching
Azure OpenAI supports prompt prefix caching ā if multiple requests share a common prefix (your system prompt), the prefix tokens are cached server-side and billed at 50% of normal input price.
Requirement: System prompt must be at least 1,024 tokens and appear at the start of the messages array.
# Structure messages so the system prompt is always first
# and consistent (never vary it between requests)
messages = [
{
"role": "system",
"content": SYSTEM_PROMPT, # This gets cached server-side
},
{
"role": "user",
"content": user_query, # This is NOT cached (different each time)
}
]Result: If your system prompt is 2,000 tokens and every request reuses it, Azure caches those tokens. You pay 50% of $5/M = $2.50/M for those tokens.
Technique 3: Request Batching (Embeddings)
Embedding API calls cost $0.10/million tokens for text-embedding-3-small. If you're embedding 100 documents one-by-one, you make 100 API calls. Batch them into one:
# ā Slow and expensive ā 100 API calls
embeddings = []
for doc in documents:
response = await client.embeddings.create(
model="text-embedding-3-small",
input=doc,
)
embeddings.append(response.data[0].embedding)
# ā
Fast ā 1 API call
response = await client.embeddings.create(
model="text-embedding-3-small",
input=documents, # Pass the whole list (max 2048 inputs)
)
embeddings = [item.embedding for item in response.data]For bulk document ingestion (seeding a knowledge base), batching reduces time from hours to minutes.
Technique 4: Model Routing
Not every query needs GPT-4o. Route simple queries to cheaper models:
Query comes in
ā
ā¼
Classify: is this complex? (requires reasoning, multi-step, code)
ā
Yes ā No ā
ā¼ ā¼
GPT-4o GPT-4o-mini
($5/M input) ($0.15/M input)
33Ć cheaperImplementation:
SIMPLE_QUERY_PATTERNS = [
r"what is \w+",
r"define \w+",
r"side effects of \w+",
r"dose of \w+",
]
def classify_query_complexity(query: str) -> str:
query_lower = query.lower()
# Simple pattern matching ā can also use a tiny classifier
for pattern in SIMPLE_QUERY_PATTERNS:
if re.search(pattern, query_lower):
return "simple"
# Long queries or those with conjunctions are likely complex
if len(query.split()) > 20:
return "complex"
if any(w in query_lower for w in ["compare", "difference", "explain why", "how does"]):
return "complex"
return "simple"
async def routed_llm_call(query: str, messages: list) -> str:
complexity = classify_query_complexity(query)
model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"
log.info("model_selected", model=model, complexity=complexity)
return await call_azure_openai(messages, model=model)Typical savings: If 70% of queries are "simple" (FAQ-style lookups), model routing cuts costs by ~60% on those queries.
Technique 5: Response Streaming + Early Termination
For use cases where users read and then stop (documentation lookup), allow early termination:
async def stream_with_cancel(messages: list, request):
async for chunk in await client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True,
):
if await request.is_disconnected():
break # User closed the browser tab ā stop generating
content = chunk.choices[0].delta.content or ""
yield f"data: {content}\n\n"If users read 30% of long responses and close the tab, you save ~70% of completion tokens on those requests.
Cost Optimization Summary
| Technique | Complexity | Typical Saving | |---|---|---| | Semantic caching | Medium | 40ā70% | | Prompt prefix caching | Low | 25ā50% on system prompt | | Embedding batching | Low | No cost saving, but speed improvement | | Model routing | Medium | 40ā70% on simple queries | | Early termination | Medium | 20ā40% on completion tokens |
Apply all five and you can cut costs from $45,000/month to under $10,000/month for the same request volume.
Checkpoint
Add a cost log field to every LLM call and run a week of traffic:
log.info("llm_cost",
cost_usd=calculate_cost(response.usage),
model=model,
cache_hit=False,
)After a week, query your logs:
customEvents
| where name == "llm_cost"
| summarize
total_cost = sum(todouble(customDimensions["cost_usd"])),
cache_hits = countif(customDimensions["cache_hit"] == "True")
by bin(timestamp, 1d)This gives you your baseline. Then implement semantic caching and measure the drop.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.