.NET & C# Development · Lesson 197 of 229
Case Study: AI Costs Went From $200 to $8,000 in One Month
Case Study: AI Costs Went From $200 to $8,000 in One Month
System: SaaS document analysis platform — 200 customers, AI-powered contract review
Stack: ASP.NET Core 9, Microsoft.Extensions.AI, OpenAI gpt-4o, Azure
Incident duration: 30 days (slow escalation, not a sudden outage)
Root causes: Five compounding mistakes
Resolution cost: 2 days of engineering, net saving $7,200/month
The Billing Spike
Month 1: $210 (beta — small user count)
Month 2: $430 (launch — 50 customers)
Month 3: $1,800 (growth — 150 customers) ← expected
Month 4: $8,400 (200 customers) ← NOT expected. 2.4x more customers, 4.7x more cost.The month-4 bill triggered an emergency review. The team expected ~$1,200 based on customer count.
Root Cause Investigation
Step 1: Get Per-Request Cost Data
C#
// The team had no per-request cost tracking — first they added it
public class TokenCostTrackingMiddleware(IChatClient inner, IMetrics metrics)
: DelegatingChatClient(inner)
{
// gpt-4o pricing
private const decimal InputCostPer1M = 2.50m;
private const decimal OutputCostPer1M = 10.00m;
public override async Task<ChatCompletion> CompleteAsync(
IList<ChatMessage> messages,
ChatOptions? options = null,
CancellationToken ct = default)
{
var response = await base.CompleteAsync(messages, options, ct);
if (response.Usage is { } usage)
{
var inputTokens = usage.InputTokenCount ?? 0;
var outputTokens = usage.OutputTokenCount ?? 0;
var cost = (inputTokens * InputCostPer1M / 1_000_000m)
+ (outputTokens * OutputCostPer1M / 1_000_000m);
metrics.RecordGauge("ai.cost_usd", (double)cost);
metrics.RecordGauge("ai.tokens.input", inputTokens);
metrics.RecordGauge("ai.tokens.output", outputTokens);
metrics.RecordGauge("ai.tokens.total", inputTokens + outputTokens);
}
return response;
}
}After one day of data, five problems were visible.
Root Cause 1: Unbounded System Prompt (40% of cost)
C#
// What the team thought they had:
var systemPrompt = "You are a contract analysis expert. Analyse this document.";
// What was actually in the database (grown over months of "improvements"):
var systemPrompt = await db.SystemPrompts.OrderByDescending(p => p.Version).FirstAsync();
// systemPrompt.Content = 4,200 tokens of instructions, examples, and legal definitions
// Added by different team members over time — nobody counted tokensSystem prompt tokens per request: 4,200
Typical analysis input tokens: 2,000
Total input: 6,200 tokens vs expected 2,000
Cost per request:
Expected: 6,200 × $2.50 / 1M = $0.016
Actual: 6,200 × $2.50 / 1M = was correct — but 4,200 tokens of that was wasteFix: Audit the system prompt. Remove redundant instructions, examples already in the model's training, and legal definitions the model already knows. Reduced from 4,200 to 380 tokens.
C#
// Added token count monitoring to the prompt management UI
public class SystemPromptService(IChatClient chatClient)
{
public int EstimateTokenCount(string text)
=> text.Length / 4; // rough approximation; use tiktoken for exact count
// Block saving prompts over 500 tokens
public async Task<SavePromptResult> SaveAsync(SystemPrompt prompt)
{
var estimated = EstimateTokenCount(prompt.Content);
if (estimated > 500)
return SavePromptResult.TooLong(estimated);
await db.SystemPrompts.AddAsync(prompt);
await db.SaveChangesAsync();
return SavePromptResult.Saved();
}
}Root Cause 2: No Response Caching (35% of cost)
Analysis requests on the same document:
Customer opens contract → analysis runs
Customer refreshes page → analysis runs again (identical prompt)
Customer shares link with colleague → analysis runs again
Customer visits next day → analysis runs again
For a 50-page contract: same document analysed 4-8 times per customer
Cache hit rate: 0% (no cache existed)C#
// Fix: exact-match cache + semantic cache
builder.Services.AddChatClient(services =>
new OpenAIClient(apiKey).AsChatClient("gpt-4o"))
.UseDistributedCache(opts =>
{
opts.ModelId = true; // include model in cache key
});
// For document analysis: cache keyed on document hash, not the full prompt
public class DocumentAnalysisService(IChatClient chatClient, IDistributedCache cache)
{
public async Task<AnalysisResult> AnalyseAsync(Document doc, CancellationToken ct)
{
// Stable cache key: document hash + analysis version
var cacheKey = $"analysis:{doc.ContentHash}:v3";
var cached = await cache.GetStringAsync(cacheKey, ct);
if (cached is not null)
return JsonSerializer.Deserialize<AnalysisResult>(cached)!;
// Cache miss — run analysis
var result = await RunAnalysisAsync(doc, ct);
// Cache for 7 days — document content doesn't change
await cache.SetStringAsync(cacheKey,
JsonSerializer.Serialize(result),
new DistributedCacheEntryOptions
{
AbsoluteExpirationRelativeToNow = TimeSpan.FromDays(7)
}, ct);
return result;
}
}Root Cause 3: Sending the Entire Document Every Request (15% of cost)
C#
// Before: full document text in every message
var messages = new List<ChatMessage>
{
new(ChatRole.System, systemPrompt),
new(ChatRole.User, $"Analyse this contract:\n\n{document.FullText}"), // 8,000 tokens
};
// After: chunk and send only relevant sections
public class ChunkedDocumentAnalyser(IEmbeddingGenerator<string, Embedding<float>> embedder)
{
public async Task<string> GetRelevantContextAsync(
Document doc,
string question,
CancellationToken ct)
{
// Use pgvector to retrieve only the relevant chunk(s)
// Average: 3 chunks × 300 tokens = 900 tokens instead of 8,000
var query = await embedder.GenerateAsync([question], cancellationToken: ct);
var chunks = await vectorStore.SearchAsync(doc.Id, query[0].Vector, topK: 3, ct);
return string.Join("\n\n---\n\n", chunks.Select(c => c.Text));
}
}Root Cause 4: Using gpt-4o for Everything (7% of cost)
C#
// Before: gpt-4o for every task including simple classification
var category = await chatClient.CompleteAsync([
new(ChatRole.User, $"Classify this contract type (NDA/Employment/Vendor/Other): {title}")
]);
// gpt-4o: $0.004 per call for 5-word classification
// After: route simple tasks to gpt-4o-mini
builder.Services.AddKeyedSingleton<IChatClient>("mini",
new OpenAIClient(apiKey).AsChatClient("gpt-4o-mini"));
builder.Services.AddKeyedSingleton<IChatClient>("full",
new OpenAIClient(apiKey).AsChatClient("gpt-4o"));
// Classification → mini (150x cheaper)
public class ContractClassifier([FromKeyedServices("mini")] IChatClient mini)
{
public async Task<string> ClassifyAsync(string title, CancellationToken ct)
{
var response = await mini.CompleteAsync([
new(ChatRole.System, "Classify the contract. Reply with one word: NDA, Employment, Vendor, or Other."),
new(ChatRole.User, title)
], cancellationToken: ct);
return response.Message.Text?.Trim() ?? "Other";
}
}
// Deep analysis → full gpt-4o
public class ContractAnalyser([FromKeyedServices("full")] IChatClient full) { ... }Root Cause 5: No MaxTokens Limit on Output (3% of cost)
C#
// Before: no output limit — model wrote 2,000-token analyses when 400 sufficed
var response = await chatClient.CompleteAsync(messages);
// After: enforce output budget
var options = new ChatOptions
{
MaxOutputTokens = 600, // 95th percentile of useful analysis length
};
var response = await chatClient.CompleteAsync(messages, options, ct);Results After All Five Fixes
Before After Saving
System prompt tokens 4,200 380 91%
Cache hit rate 0% 67% —
Avg input tokens/request 12,400 1,800 85%
Model routing 100% 4o 20% 4o —
Max output tokens unlimited 600 —
Monthly cost:
Month 4 (before): $8,400
Month 5 (after): $890 89% reduction
Month 5 projected
without fix: ~$11,200 (customer growth continued)
Engineering time to fix: 2 days
Monthly saving: $7,510
ROI: 6 hours of work = $90K/year savedMonitoring That Now Exists
C#
// Dashboard alerts:
// cost/request > $0.05 → investigate
// cache hit rate < 50% → check caching
// input tokens/request > 3,000 → check prompt size
// output tokens/request > 800 → check MaxOutputTokens
// Weekly cost report email to tech lead
public class WeeklyCostReportJob(IMetrics metrics, IEmailService email) : IHostedService
{
public async Task StartAsync(CancellationToken ct)
{
var weekly = await metrics.GetSumAsync("ai.cost_usd", TimeSpan.FromDays(7));
var monthly = weekly * 4.3m;
var perUser = weekly / await GetActiveUserCountAsync(ct);
await email.SendAsync(new EmailMessage(
To: "tech-lead@company.com",
Subject: $"AI Cost Report — ${weekly:F0} this week (${monthly:F0}/mo projected)",
Body: $"Per active user: ${perUser:F2}/week\nCache hit rate: {await GetCacheHitRateAsync(ct):P0}"));
}
}The Five Rules
1. Monitor cost per request from day one, not when the bill arrives.
2. Prompt engineering is cost engineering.
Every 1,000 tokens added to the system prompt = $2.50 per 1M requests.
Count your tokens.
3. Cache is mandatory, not optional.
Identical prompts should never hit the API twice.
Use exact-match cache for the same document, same question.
4. Route to the cheapest model that achieves acceptable quality.
Classification and extraction: gpt-4o-mini (150x cheaper than gpt-4o).
Deep reasoning: gpt-4o.
Complex multi-step: o3 (use sparingly).
5. Set MaxOutputTokens.
Left uncapped, models will write essays when you need bullet points.
Measure the 95th percentile of useful output length — cap there.