GenAI & LLM Interviews · Lesson 8 of 30
Interview: Prompt Engineering (Part 2)
Q1: How does function calling differ from parsing free text output?
Answer: Function calling (also called "tool use") is an API feature where the model returns a structured JSON object describing a tool invocation rather than natural language text. Free text output parsing involves extracting structured data from natural language responses.
Free text parsing:
Prompt: "Extract the drug name and dose from: patient takes warfarin 5mg daily"
Output: "The drug is warfarin and the dose is 5mg per day."
Parser: regex / string search to extract values
Problem: model may use different phrasing each time, breaking the parserFunction calling:
Model output: {"function": "extract_medication", "arguments": {"name": "warfarin", "dose_mg": 5.0, "frequency": "daily"}}
Result: always valid JSON, always correct field names, parseable with json.loads()Why function calling is better:
- Output is schema-constrained — model can't invent fields or use unexpected formats
- Argument types are enforced — the model produces a number for
dose_mg, not "5 milligrams" - No parsing logic needed — direct
json.loads()and type validation - Model can signal "I need information before I can act" by calling an information-retrieval tool
When free text parsing is still used:
- Models that don't support function calling
- Simple extraction where the pattern is highly reliable
- Cases where you want the model to "decide" the structure, not conform to a predefined schema
Q2: Explain the prompt structure for a reliable document Q&A system.
Answer: A reliable document Q&A system requires careful prompt structuring:
System prompt: Define the assistant's role, scope, and the critical instruction to only use provided documents:
You are a clinical pharmacist. Answer questions ONLY using the provided reference documents.
If the documents don't contain information needed to answer, say: "The provided documents don't address this."
Cite document numbers for every factual claim.User message structure:
[Document 1] {title and content}
[Document 2] {title and content}
...
Question: {user question}Key design decisions:
- Explicit grounding instruction: "Answer ONLY using the provided documents" — prevents hallucination from pretraining knowledge
- Citation requirement: Forces the model to link every claim to a source — detects hallucination (if the model cites [Document 3] but there's no Document 3, or the claim isn't in the cited document)
- Graceful degradation: "If not in documents, say so" — better than making something up
- Document ordering: Most relevant documents at the start and end to combat lost-in-the-middle
- Document count: Fewer, more relevant documents outperform many weakly relevant ones
For production RAG: add an output validator that checks whether cited documents actually contain the claimed information.
Q3: What is self-consistency and when would you use it vs chain-of-thought?
Answer: Chain-of-thought (CoT) generates one reasoning path at temperature 0 or low temperature. Self-consistency generates N reasoning paths at higher temperature, then takes a majority vote on the final answers.
Self-consistency advantages:
- More robust to a single reasoning path making an error early on
- Better calibrated confidence (if 4 of 5 paths agree, that's more reliable than 1 path)
- Empirically provides 5-15% accuracy gains on multi-step reasoning tasks
When to use CoT alone:
- Cost is a concern — self-consistency costs N× more
- Tasks where one path is highly reliable (simple calculation, single-step lookup)
- Latency-sensitive applications (N parallel calls still take longer)
When to use self-consistency:
- High-stakes decisions where a wrong answer has significant consequences
- Complex clinical calculations (e.g., dosing adjustments with multiple patient parameters)
- Logical reasoning tasks where early errors compound
- When you're willing to pay the N× cost for higher reliability
Rule of thumb: Use self-consistency (N=5-7 samples) when the cost of a wrong answer significantly exceeds the cost of 5-7 LLM calls. For a clinical dosing decision that might harm a patient, the extra cost is trivially justified.
Q4: How do you evaluate whether a new system prompt version is better than the old one?
Answer: Structured prompt evaluation:
Step 1: Maintain an eval dataset. A collection of (input, expected_output_or_criteria) pairs covering normal cases, edge cases, safety cases, and out-of-scope requests. Minimum 50-100 cases; aim for 200+.
Step 2: Score both versions. Run both prompts against the full eval dataset using consistent scoring (LLM-as-judge with a fixed rubric, or exact-match for structured outputs).
Step 3: Compare by category. A change that improves drug interactions by 10% but regresses safety cases by 15% is not an improvement. Look at category-level results, not just overall.
Step 4: Check for regressions. Any case that passes in v1 but fails in v2 is a regression. List all regressions explicitly.
Step 5: A/B test on live traffic (optional). Route a fraction of live requests to each version. Collect user feedback or implicit signals (did the user ask for clarification? Did they accept the response?).
Decision criteria: Deploy the new prompt only if:
- Overall pass rate is equal or higher
- Safety category pass rate doesn't decrease
- No regressions in critical cases (e.g., known edge cases from past production failures)
Q5: What are the pros and cons of very long system prompts?
Answer:
Pros of longer system prompts:
- More coverage of edge cases — explicit instructions for more scenarios
- Less model discretion — the model follows explicit rules rather than guessing
- Easier to audit — all behavior is documented in one place
Cons of longer system prompts:
- Attention dilution: models attend to all parts of the context, but longer prompts mean each instruction gets proportionally less "attention weight"
- Instruction conflicts: more instructions increase the chance of contradictory rules; the model may favor one over the other unpredictably
- Maintenance burden: 2000-word prompts become hard to update safely
- Cost: every token in the system prompt costs money on every request
Empirical observation: Prompt quality decays roughly past 500-700 words. Beyond that, the model often fails to follow low-priority instructions consistently. The middle of a long system prompt is less reliably followed than the beginning and end.
Best practice: Keep core behavior instructions under 500 words. Use few-shot examples rather than descriptions for complex format requirements. Move conditional behavior (handle edge case X like Y) into separate prompt variants routed by a classifier, rather than bloating a single prompt.
Q6: System design — build a prompt engineering infrastructure for a clinical AI platform
Scenario: 10 different AI features (drug interaction checker, discharge summaries, patient counseling, dosing calculator, etc.) used by 500 clinical staff. How do you manage prompts at scale?
Answer:
Prompt Registry: Version-controlled YAML/JSON files per feature, stored in Git. Each file contains:
- System prompt template with
{variable}placeholders - Declared variables and defaults
- Required eval score threshold for deployment
- Author, version, changelog
Eval Pipeline (CI/CD): Every PR touching a prompt file triggers:
- Render prompt with test variables
- Run against the feature's eval dataset (100+ cases)
- Check against minimum pass rates by category (safety: 100%, accuracy: 85%)
- Block merge if any threshold fails
A/B Testing Infrastructure: Traffic splitting at the application layer:
- 90% gets the current production prompt
- 10% gets the candidate prompt
- Implicit signals: user follow-up rate, correction requests, escalations
- Explicit evals: random sample of responses scored by LLM judge
Prompt Versioning and Rollback: Each deployed prompt has a version tag. A/B test winner becomes the new production version. Previous version is kept for 30-day rollback window. Audit log of every version deployed.
Monitoring: Track per-feature, per-prompt-version:
- LLM judge score on a daily random sample of 50 responses
- Input token count (watch for context window issues)
- Error rate (parse failures, validation failures)
- Latency percentiles
Feature Flag Integration: Prompts are associated with feature flags. Roll out new prompts to 10% → 25% → 50% → 100% with automatic rollback if error rates spike.
Q7: How do you handle prompts that need to work across multiple LLM providers?
Answer: Provider-agnostic prompt design requires abstraction:
Common issues:
- System prompt placement: OpenAI uses a "system" role message; Anthropic uses the
systemparameter; older models may not have a system role - Instruction following calibration: Claude tends to be more compliant with negative instructions; GPT-4o may require more explicit constraints
- Output format: Different models have different verbosity defaults; format instructions must be explicit
Abstraction approach:
class PromptAdapter:
def to_openai(self, system: str, user: str) -> list[dict]:
return [{"role": "system", "content": system}, {"role": "user", "content": user}]
def to_anthropic(self, system: str, user: str) -> dict:
return {"system": system, "messages": [{"role": "user", "content": user}]}
def to_provider(self, provider: str, system: str, user: str):
return getattr(self, f"to_{provider}")(system, user)Provider-specific calibration: Maintain separate prompt versions for each provider when behavior differs significantly. The same instructions produce different results on GPT-4o vs Claude — test each separately.
Test on all target providers: Your eval suite must run against all providers you support. A prompt with 90% pass rate on GPT-4o may be 75% on Claude 3 Haiku.
Q8: What is the difference between prompt engineering and fine-tuning, and when do you choose each?
Answer:
| | Prompt Engineering | Fine-Tuning | |---|---|---| | Speed to test | Minutes | Hours to days | | Cost | Per-call inference cost only | GPU time for training + ongoing inference | | Knowledge | Can't teach new facts | Can teach new facts and patterns | | Format control | Possible with instructions | More reliable with training | | Domain expertise | Limited to pretraining knowledge | Can specialize on domain data | | Model updates | Works on new model versions | Must retrain on new base | | Reversibility | Immediate — just change the prompt | Requires new training run |
Use prompt engineering when:
- The task is in the model's pretraining distribution
- Format control and behavior constraints are the main need
- You need to iterate quickly
- The task type changes (general-purpose system)
Use fine-tuning when:
- The task requires knowledge not in pretraining (proprietary terminology, internal processes)
- Prompt engineering has plateaued — you've exhausted optimization
- Consistent style/format is critical and prompts can't achieve it reliably
- Cost at scale: a smaller fine-tuned model is cheaper than prompting a large model
Q9: How do you reduce hallucination in LLM outputs?
Answer: Hallucination reduction requires multiple coordinated techniques:
Retrieval grounding: "Answer only using these documents" + citation requirements. Any claim that can't be cited is forced to be either retrieved or declared unknown.
Uncertainty acknowledgment instructions: "If you are not certain about a specific fact, say 'I'm not certain — verify with [reference]' rather than guessing. Confident wrong answers are more dangerous than acknowledged uncertainty."
Output validation: Run generated claims through a fact-checking step (another LLM call or structured lookup):
def validate_claims(output: str, knowledge_base: dict) -> list[str]:
"""Check factual claims against a structured knowledge base."""
...Constrained generation: For structured outputs, use JSON mode or Pydantic schemas. When the model is constrained to pick from enumerated values, it can't hallucinate arbitrary strings.
Temperature = 0: Higher temperature increases hallucination probability. For factual tasks, always use greedy decoding.
Evaluation: Test with known-unknown inputs (questions about drugs, events, or facts that don't exist). A well-calibrated model should refuse to answer or express uncertainty rather than fabricate plausible-sounding information.
Q10: What are the most common production prompt engineering failures and how do you prevent them?
Answer:
1. Prompt regression after model updates
Prevention: Monitor system_fingerprint changes; run eval suite automatically when detected; maintain eval pass rates as a deployment gate.
2. Format degradation over long conversations Prevention: Reinforce format in the system prompt with explicit examples; validate outputs structurally; consider a format-restoration step for long sessions.
3. Instruction overload Prevention: Keep system prompts under 500 words; remove instructions that can be replaced by examples; measure instruction compliance in eval suite.
4. Context window overflow in production Prevention: Track token counts at runtime; implement context management (sliding window, summarization); alert when requests approach 80% of context limit.
5. Prompt injection from user-provided documents Prevention: Wrap documents in delimiters; instruct model to treat enclosed content as data; validate outputs for injection signatures.
6. Few-shot examples drifting from policy Prevention: Version control examples with the prompt; validate examples in CI; require examples to pass the same eval criteria as the prompt.
7. Sycophancy bias in clinical or high-stakes settings Prevention: Include anti-sycophancy instructions; add adversarial test cases (present wrong information confidently and verify the model corrects it); use a critical role framing ("You are a skeptical reviewer").
8. Cost overruns from long contexts Prevention: Monitor average input token counts per feature; alert on outliers; implement hard token budget limits with truncation logic.