Why Prompts Matter in Production
Why prompt quality has outsized impact on LLM output quality, reliability, cost, and safety — and why prompt engineering is a core engineering discipline for AI systems.
The Prompt Is the Code
In a traditional software system, the application logic lives in code — predictable, versioned, testable. In an LLM-powered system, a large part of the application logic lives in the prompt:
Traditional system:
Code: if (sentiment == "negative") return Priority.HIGH
LLM system:
Prompt: "If the clinical note indicates distress or urgency, classify as HIGH."
Logic: implied in natural language — interpreted probabilistically by the model
The prompt IS the logic. A bad prompt = a buggy program.Impact on Output Quality
Small prompt changes can have large quality effects:
Prompt A: "Summarise the following medical note."
Output: "Patient has warfarin and hypertension."
Quality: generic, missing key clinical details
Prompt B: "Summarise the following medical note in 2-3 sentences.
Include: primary diagnosis, current medications, next action.
Write for a nurse handing off to the next shift."
Output: "Patient is a 68yo female with newly diagnosed atrial fibrillation,
currently prescribed Warfarin 5mg daily. INR last checked 2 weeks ago.
Next action: schedule INR follow-up and cardiology consult."
Quality: clinically useful, complete, actionable
The second prompt takes 30 extra words and produces dramatically more useful output.Impact on Reliability
Without explicit constraints, LLMs will hallucinate, format inconsistently, and vary across calls:
Problem: "Extract the patient's drug list."
Without schema: model may return:
Call 1: "The patient takes Warfarin, Metformin, and Lisinopril."
Call 2: "Medications: 1. Warfarin 2. Metformin 3. Lisinopril"
Call 3: ["warfarin", "metformin"] (JSON, different format)
Call 4: "The patient's drug regimen includes several..."
With structured output prompt:
Always returns: {"medications": ["Warfarin", "Metformin", "Lisinopril"]}
Downstream code can't reliably parse inconsistent formats.
Structured prompts are non-negotiable for production pipelines.Impact on Cost
Token efficiency matters at scale:
At $0.005/1K input tokens (GPT-4o approximate):
System prompt: 500 tokens
User message: 1000 tokens average
Total input: 1500 tokens
At 10,000 requests/day:
10,000 × 1500 / 1000 × $0.005 = $75/day = $2,250/month
Optimising system prompt from 500 → 250 tokens:
10,000 × 1250 / 1000 × $0.005 = $62.50/day = $1,875/month
Saves $375/month — just from prompt compression
At 1M requests/day: 250-token reduction saves $37,500/month.
Every token in the system prompt is paid for every request.Impact on Safety
The system prompt is the primary safety control layer in production LLM applications:
Without safety constraints:
User: "What's the maximum safe dose of Acetaminophen I can take?"
LLM: "The maximum recommended dose is 4000mg/day. You can take 1000mg
every 6 hours. However if you..."
Risk: LLM providing specific medical dosage advice without appropriate caveats
With safety-aware prompt:
"You are a medical information assistant. You provide general health information only.
Always recommend consulting a healthcare provider for medical decisions.
Never provide specific dosage recommendations for medications."
User: "What's the maximum safe dose of Acetaminophen?"
LLM: "For general information: Acetaminophen dosing varies by individual.
Please consult your pharmacist or physician for guidance specific to you."
The prompt is the guardrail. Absent a guardrail, the default behaviour is whatever
the training distribution most commonly produces.Prompts Are a First-Class Engineering Artifact
Good prompt engineering practices:
Version control: store prompts in code repository, not in memory or UI
Evaluation: measure against a test set before deploying changes
Monitoring: log prompt + response for debugging and quality tracking
A/B testing: compare prompt versions on real traffic with metrics
Documentation: explain what the prompt does and what it's not designed for
Change management: prompt changes go through code review like any other changeInterview Answer
"Prompts are the primary mechanism for controlling LLM behaviour in production — they're effectively the application logic for AI features. Quality prompts produce more accurate, consistent, and appropriately formatted outputs. Poorly crafted prompts lead to hallucination, inconsistent formatting, and missing safety guardrails. At scale, prompt token count also directly impacts cost: a 250-token reduction in a system prompt used 1M times daily saves ~$37K/month at typical API prices. For these reasons, prompts should be version-controlled, evaluated against test sets, and treated as first-class engineering artifacts."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.