.NET & C# Development · Lesson 197 of 229

Case Study: AI Costs Went From $200 to $8,000 in One Month

System: SaaS document analysis platform — 200 customers, AI-powered contract review
Stack: ASP.NET Core 9, Microsoft.Extensions.AI, OpenAI gpt-4o, Azure
Incident duration: 30 days (slow escalation, not a sudden outage)
Root causes: Five compounding mistakes
Resolution cost: 2 days of engineering, net saving $7,200/month

The Billing Spike

Month 1:  $210  (beta — small user count)
Month 2:  $430  (launch — 50 customers)
Month 3:  $1,800 (growth — 150 customers)  ← expected
Month 4:  $8,400 (200 customers)           ← NOT expected. 2.4x more customers, 4.7x more cost.

The month-4 bill triggered an emergency review. The team expected ~$1,200 based on customer count.

Root Cause Investigation

Step 1: Get Per-Request Cost Data

// The team had no per-request cost tracking — first they added it
public class TokenCostTrackingMiddleware(IChatClient inner, IMetrics metrics)
    : DelegatingChatClient(inner)
{
    // gpt-4o pricing
    private const decimal InputCostPer1M  = 2.50m;
    private const decimal OutputCostPer1M = 10.00m;

    public override async Task<ChatCompletion> CompleteAsync(
        IList<ChatMessage> messages,
        ChatOptions? options = null,
        CancellationToken ct = default)
    {
        var response = await base.CompleteAsync(messages, options, ct);

        if (response.Usage is { } usage)
        {
            var inputTokens  = usage.InputTokenCount  ?? 0;
            var outputTokens = usage.OutputTokenCount ?? 0;

            var cost = (inputTokens  * InputCostPer1M  / 1_000_000m)
                     + (outputTokens * OutputCostPer1M / 1_000_000m);

            metrics.RecordGauge("ai.cost_usd",           (double)cost);
            metrics.RecordGauge("ai.tokens.input",        inputTokens);
            metrics.RecordGauge("ai.tokens.output",       outputTokens);
            metrics.RecordGauge("ai.tokens.total",        inputTokens + outputTokens);
        }

        return response;
    }
}

After one day of data, five problems were visible.

Root Cause 1: Unbounded System Prompt (40% of cost)

// What the team thought they had:
var systemPrompt = "You are a contract analysis expert. Analyse this document.";

// What was actually in the database (grown over months of "improvements"):
var systemPrompt = await db.SystemPrompts.OrderByDescending(p => p.Version).FirstAsync();
// systemPrompt.Content = 4,200 tokens of instructions, examples, and legal definitions
// Added by different team members over time — nobody counted tokens

System prompt tokens per request: 4,200
Typical analysis input tokens:    2,000
Total input: 6,200 tokens vs expected 2,000

Cost per request:
  Expected:  6,200 × $2.50 / 1M = $0.016
  Actual:    6,200 × $2.50 / 1M = was correct — but 4,200 tokens of that was waste

Fix: Audit the system prompt. Remove redundant instructions, examples already in the model's training, and legal definitions the model already knows. Reduced from 4,200 to 380 tokens.

// Added token count monitoring to the prompt management UI
public class SystemPromptService(IChatClient chatClient)
{
    public int EstimateTokenCount(string text)
        => text.Length / 4;   // rough approximation; use tiktoken for exact count

    // Block saving prompts over 500 tokens
    public async Task<SavePromptResult> SaveAsync(SystemPrompt prompt)
    {
        var estimated = EstimateTokenCount(prompt.Content);
        if (estimated > 500)
            return SavePromptResult.TooLong(estimated);

        await db.SystemPrompts.AddAsync(prompt);
        await db.SaveChangesAsync();
        return SavePromptResult.Saved();
    }
}

Root Cause 2: No Response Caching (35% of cost)

Analysis requests on the same document:
  Customer opens contract → analysis runs
  Customer refreshes page → analysis runs again (identical prompt)
  Customer shares link with colleague → analysis runs again
  Customer visits next day → analysis runs again

For a 50-page contract: same document analysed 4-8 times per customer
Cache hit rate: 0% (no cache existed)

// Fix: exact-match cache + semantic cache
builder.Services.AddChatClient(services =>
    new OpenAIClient(apiKey).AsChatClient("gpt-4o"))
    .UseDistributedCache(opts =>
    {
        opts.ModelId = true;   // include model in cache key
    });

// For document analysis: cache keyed on document hash, not the full prompt
public class DocumentAnalysisService(IChatClient chatClient, IDistributedCache cache)
{
    public async Task<AnalysisResult> AnalyseAsync(Document doc, CancellationToken ct)
    {
        // Stable cache key: document hash + analysis version
        var cacheKey = $"analysis:{doc.ContentHash}:v3";

        var cached = await cache.GetStringAsync(cacheKey, ct);
        if (cached is not null)
            return JsonSerializer.Deserialize<AnalysisResult>(cached)!;

        // Cache miss — run analysis
        var result = await RunAnalysisAsync(doc, ct);

        // Cache for 7 days — document content doesn't change
        await cache.SetStringAsync(cacheKey,
            JsonSerializer.Serialize(result),
            new DistributedCacheEntryOptions
            {
                AbsoluteExpirationRelativeToNow = TimeSpan.FromDays(7)
            }, ct);

        return result;
    }
}

Root Cause 3: Sending the Entire Document Every Request (15% of cost)

// Before: full document text in every message
var messages = new List<ChatMessage>
{
    new(ChatRole.System, systemPrompt),
    new(ChatRole.User,   $"Analyse this contract:\n\n{document.FullText}"),  // 8,000 tokens
};

// After: chunk and send only relevant sections
public class ChunkedDocumentAnalyser(IEmbeddingGenerator<string, Embedding<float>> embedder)
{
    public async Task<string> GetRelevantContextAsync(
        Document doc,
        string question,
        CancellationToken ct)
    {
        // Use pgvector to retrieve only the relevant chunk(s)
        // Average: 3 chunks × 300 tokens = 900 tokens instead of 8,000
        var query   = await embedder.GenerateAsync([question], cancellationToken: ct);
        var chunks  = await vectorStore.SearchAsync(doc.Id, query[0].Vector, topK: 3, ct);
        return string.Join("\n\n---\n\n", chunks.Select(c => c.Text));
    }
}

Root Cause 4: Using gpt-4o for Everything (7% of cost)

// Before: gpt-4o for every task including simple classification
var category = await chatClient.CompleteAsync([
    new(ChatRole.User, $"Classify this contract type (NDA/Employment/Vendor/Other): {title}")
]);
// gpt-4o: $0.004 per call for 5-word classification

// After: route simple tasks to gpt-4o-mini
builder.Services.AddKeyedSingleton<IChatClient>("mini",
    new OpenAIClient(apiKey).AsChatClient("gpt-4o-mini"));

builder.Services.AddKeyedSingleton<IChatClient>("full",
    new OpenAIClient(apiKey).AsChatClient("gpt-4o"));

// Classification → mini (150x cheaper)
public class ContractClassifier([FromKeyedServices("mini")] IChatClient mini)
{
    public async Task<string> ClassifyAsync(string title, CancellationToken ct)
    {
        var response = await mini.CompleteAsync([
            new(ChatRole.System, "Classify the contract. Reply with one word: NDA, Employment, Vendor, or Other."),
            new(ChatRole.User, title)
        ], cancellationToken: ct);
        return response.Message.Text?.Trim() ?? "Other";
    }
}

// Deep analysis → full gpt-4o
public class ContractAnalyser([FromKeyedServices("full")] IChatClient full) { ... }

Root Cause 5: No MaxTokens Limit on Output (3% of cost)

// Before: no output limit — model wrote 2,000-token analyses when 400 sufficed
var response = await chatClient.CompleteAsync(messages);

// After: enforce output budget
var options = new ChatOptions
{
    MaxOutputTokens = 600,   // 95th percentile of useful analysis length
};
var response = await chatClient.CompleteAsync(messages, options, ct);

Results After All Five Fixes

                          Before      After      Saving
System prompt tokens       4,200        380        91%
Cache hit rate               0%         67%         —
Avg input tokens/request   12,400      1,800        85%
Model routing               100% 4o    20% 4o       —
Max output tokens          unlimited    600         —

Monthly cost:
  Month 4 (before):     $8,400
  Month 5 (after):        $890      89% reduction
  Month 5 projected
    without fix:         ~$11,200   (customer growth continued)

Engineering time to fix:  2 days
Monthly saving:           $7,510
ROI:                      6 hours of work = $90K/year saved

Monitoring That Now Exists

// Dashboard alerts:
//   cost/request > $0.05  → investigate
//   cache hit rate < 50%  → check caching
//   input tokens/request > 3,000 → check prompt size
//   output tokens/request > 800  → check MaxOutputTokens

// Weekly cost report email to tech lead
public class WeeklyCostReportJob(IMetrics metrics, IEmailService email) : IHostedService
{
    public async Task StartAsync(CancellationToken ct)
    {
        var weekly   = await metrics.GetSumAsync("ai.cost_usd", TimeSpan.FromDays(7));
        var monthly  = weekly * 4.3m;
        var perUser  = weekly / await GetActiveUserCountAsync(ct);

        await email.SendAsync(new EmailMessage(
            To:      "tech-lead@company.com",
            Subject: $"AI Cost Report — ${weekly:F0} this week (${monthly:F0}/mo projected)",
            Body:    $"Per active user: ${perUser:F2}/week\nCache hit rate: {await GetCacheHitRateAsync(ct):P0}"));
    }
}

The Five Rules

1. Monitor cost per request from day one, not when the bill arrives.

2. Prompt engineering is cost engineering.
   Every 1,000 tokens added to the system prompt = $2.50 per 1M requests.
   Count your tokens.

3. Cache is mandatory, not optional.
   Identical prompts should never hit the API twice.
   Use exact-match cache for the same document, same question.

4. Route to the cheapest model that achieves acceptable quality.
   Classification and extraction: gpt-4o-mini (150x cheaper than gpt-4o).
   Deep reasoning: gpt-4o.
   Complex multi-step: o3 (use sparingly).

5. Set MaxOutputTokens.
   Left uncapped, models will write essays when you need bullet points.
   Measure the 95th percentile of useful output length — cap there.

Case Study: The N+1 Query That Took Down Production

Next Lesson

Case Study: Migrating a 12-Year-Old Monolith Without Downtime