GenAI & LLM Interviews · Lesson 8 of 30

Interview: Prompt Engineering (Part 2)

Q1: How does function calling differ from parsing free text output?

Answer: Function calling (also called "tool use") is an API feature where the model returns a structured JSON object describing a tool invocation rather than natural language text. Free text output parsing involves extracting structured data from natural language responses.

Free text parsing:

Prompt: "Extract the drug name and dose from: patient takes warfarin 5mg daily"
Output: "The drug is warfarin and the dose is 5mg per day."
Parser: regex / string search to extract values
Problem: model may use different phrasing each time, breaking the parser

Function calling:

Model output: {"function": "extract_medication", "arguments": {"name": "warfarin", "dose_mg": 5.0, "frequency": "daily"}}
Result: always valid JSON, always correct field names, parseable with json.loads()

Why function calling is better:

Output is schema-constrained — model can't invent fields or use unexpected formats
Argument types are enforced — the model produces a number for dose_mg, not "5 milligrams"
No parsing logic needed — direct json.loads() and type validation
Model can signal "I need information before I can act" by calling an information-retrieval tool

When free text parsing is still used:

Models that don't support function calling
Simple extraction where the pattern is highly reliable
Cases where you want the model to "decide" the structure, not conform to a predefined schema

Q2: Explain the prompt structure for a reliable document Q&A system.

Answer: A reliable document Q&A system requires careful prompt structuring:

System prompt: Define the assistant's role, scope, and the critical instruction to only use provided documents:

You are a clinical pharmacist. Answer questions ONLY using the provided reference documents.
If the documents don't contain information needed to answer, say: "The provided documents don't address this."
Cite document numbers for every factual claim.

User message structure:

[Document 1] {title and content}
[Document 2] {title and content}
...

Question: {user question}

Key design decisions:

Explicit grounding instruction: "Answer ONLY using the provided documents" — prevents hallucination from pretraining knowledge
Citation requirement: Forces the model to link every claim to a source — detects hallucination (if the model cites [Document 3] but there's no Document 3, or the claim isn't in the cited document)
Graceful degradation: "If not in documents, say so" — better than making something up
Document ordering: Most relevant documents at the start and end to combat lost-in-the-middle
Document count: Fewer, more relevant documents outperform many weakly relevant ones

For production RAG: add an output validator that checks whether cited documents actually contain the claimed information.

Q3: What is self-consistency and when would you use it vs chain-of-thought?

Answer: Chain-of-thought (CoT) generates one reasoning path at temperature 0 or low temperature. Self-consistency generates N reasoning paths at higher temperature, then takes a majority vote on the final answers.

Self-consistency advantages:

More robust to a single reasoning path making an error early on
Better calibrated confidence (if 4 of 5 paths agree, that's more reliable than 1 path)
Empirically provides 5-15% accuracy gains on multi-step reasoning tasks

When to use CoT alone:

Cost is a concern — self-consistency costs N× more
Tasks where one path is highly reliable (simple calculation, single-step lookup)
Latency-sensitive applications (N parallel calls still take longer)

When to use self-consistency:

High-stakes decisions where a wrong answer has significant consequences
Complex clinical calculations (e.g., dosing adjustments with multiple patient parameters)
Logical reasoning tasks where early errors compound
When you're willing to pay the N× cost for higher reliability

Rule of thumb: Use self-consistency (N=5-7 samples) when the cost of a wrong answer significantly exceeds the cost of 5-7 LLM calls. For a clinical dosing decision that might harm a patient, the extra cost is trivially justified.

Q4: How do you evaluate whether a new system prompt version is better than the old one?

Answer: Structured prompt evaluation:

Step 1: Maintain an eval dataset. A collection of (input, expected_output_or_criteria) pairs covering normal cases, edge cases, safety cases, and out-of-scope requests. Minimum 50-100 cases; aim for 200+.

Step 2: Score both versions. Run both prompts against the full eval dataset using consistent scoring (LLM-as-judge with a fixed rubric, or exact-match for structured outputs).

Step 3: Compare by category. A change that improves drug interactions by 10% but regresses safety cases by 15% is not an improvement. Look at category-level results, not just overall.

Step 4: Check for regressions. Any case that passes in v1 but fails in v2 is a regression. List all regressions explicitly.

Step 5: A/B test on live traffic (optional). Route a fraction of live requests to each version. Collect user feedback or implicit signals (did the user ask for clarification? Did they accept the response?).

Decision criteria: Deploy the new prompt only if:

Overall pass rate is equal or higher
Safety category pass rate doesn't decrease
No regressions in critical cases (e.g., known edge cases from past production failures)

Q5: What are the pros and cons of very long system prompts?

Answer:

Pros of longer system prompts:

More coverage of edge cases — explicit instructions for more scenarios
Less model discretion — the model follows explicit rules rather than guessing
Easier to audit — all behavior is documented in one place

Cons of longer system prompts:

Attention dilution: models attend to all parts of the context, but longer prompts mean each instruction gets proportionally less "attention weight"
Instruction conflicts: more instructions increase the chance of contradictory rules; the model may favor one over the other unpredictably
Maintenance burden: 2000-word prompts become hard to update safely
Cost: every token in the system prompt costs money on every request

Empirical observation: Prompt quality decays roughly past 500-700 words. Beyond that, the model often fails to follow low-priority instructions consistently. The middle of a long system prompt is less reliably followed than the beginning and end.

Best practice: Keep core behavior instructions under 500 words. Use few-shot examples rather than descriptions for complex format requirements. Move conditional behavior (handle edge case X like Y) into separate prompt variants routed by a classifier, rather than bloating a single prompt.

Q6: System design — build a prompt engineering infrastructure for a clinical AI platform

Scenario: 10 different AI features (drug interaction checker, discharge summaries, patient counseling, dosing calculator, etc.) used by 500 clinical staff. How do you manage prompts at scale?

Answer:

Prompt Registry: Version-controlled YAML/JSON files per feature, stored in Git. Each file contains:

System prompt template with {variable} placeholders
Declared variables and defaults
Required eval score threshold for deployment
Author, version, changelog

Eval Pipeline (CI/CD): Every PR touching a prompt file triggers:

Render prompt with test variables
Run against the feature's eval dataset (100+ cases)
Check against minimum pass rates by category (safety: 100%, accuracy: 85%)
Block merge if any threshold fails

A/B Testing Infrastructure: Traffic splitting at the application layer:

90% gets the current production prompt
10% gets the candidate prompt
Implicit signals: user follow-up rate, correction requests, escalations
Explicit evals: random sample of responses scored by LLM judge

Prompt Versioning and Rollback: Each deployed prompt has a version tag. A/B test winner becomes the new production version. Previous version is kept for 30-day rollback window. Audit log of every version deployed.

Monitoring: Track per-feature, per-prompt-version:

LLM judge score on a daily random sample of 50 responses
Input token count (watch for context window issues)
Error rate (parse failures, validation failures)
Latency percentiles

Feature Flag Integration: Prompts are associated with feature flags. Roll out new prompts to 10% → 25% → 50% → 100% with automatic rollback if error rates spike.

Q7: How do you handle prompts that need to work across multiple LLM providers?

Answer: Provider-agnostic prompt design requires abstraction:

Common issues:

System prompt placement: OpenAI uses a "system" role message; Anthropic uses the system parameter; older models may not have a system role
Instruction following calibration: Claude tends to be more compliant with negative instructions; GPT-4o may require more explicit constraints
Output format: Different models have different verbosity defaults; format instructions must be explicit

Abstraction approach:

Python

class PromptAdapter:
    def to_openai(self, system: str, user: str) -> list[dict]:
        return [{"role": "system", "content": system}, {"role": "user", "content": user}]

    def to_anthropic(self, system: str, user: str) -> dict:
        return {"system": system, "messages": [{"role": "user", "content": user}]}

    def to_provider(self, provider: str, system: str, user: str):
        return getattr(self, f"to_{provider}")(system, user)

Provider-specific calibration: Maintain separate prompt versions for each provider when behavior differs significantly. The same instructions produce different results on GPT-4o vs Claude — test each separately.

Test on all target providers: Your eval suite must run against all providers you support. A prompt with 90% pass rate on GPT-4o may be 75% on Claude 3 Haiku.

Q8: What is the difference between prompt engineering and fine-tuning, and when do you choose each?

Answer:

| | Prompt Engineering | Fine-Tuning | |---|---|---| | Speed to test | Minutes | Hours to days | | Cost | Per-call inference cost only | GPU time for training + ongoing inference | | Knowledge | Can't teach new facts | Can teach new facts and patterns | | Format control | Possible with instructions | More reliable with training | | Domain expertise | Limited to pretraining knowledge | Can specialize on domain data | | Model updates | Works on new model versions | Must retrain on new base | | Reversibility | Immediate — just change the prompt | Requires new training run |

Use prompt engineering when:

The task is in the model's pretraining distribution
Format control and behavior constraints are the main need
You need to iterate quickly
The task type changes (general-purpose system)

Use fine-tuning when:

The task requires knowledge not in pretraining (proprietary terminology, internal processes)
Prompt engineering has plateaued — you've exhausted optimization
Consistent style/format is critical and prompts can't achieve it reliably
Cost at scale: a smaller fine-tuned model is cheaper than prompting a large model

Q9: How do you reduce hallucination in LLM outputs?

Answer: Hallucination reduction requires multiple coordinated techniques:

Retrieval grounding: "Answer only using these documents" + citation requirements. Any claim that can't be cited is forced to be either retrieved or declared unknown.

Uncertainty acknowledgment instructions: "If you are not certain about a specific fact, say 'I'm not certain — verify with [reference]' rather than guessing. Confident wrong answers are more dangerous than acknowledged uncertainty."

Output validation: Run generated claims through a fact-checking step (another LLM call or structured lookup):

Python

def validate_claims(output: str, knowledge_base: dict) -> list[str]:
    """Check factual claims against a structured knowledge base."""
    ...

Constrained generation: For structured outputs, use JSON mode or Pydantic schemas. When the model is constrained to pick from enumerated values, it can't hallucinate arbitrary strings.

Temperature = 0: Higher temperature increases hallucination probability. For factual tasks, always use greedy decoding.

Evaluation: Test with known-unknown inputs (questions about drugs, events, or facts that don't exist). A well-calibrated model should refuse to answer or express uncertainty rather than fabricate plausible-sounding information.

Q10: What are the most common production prompt engineering failures and how do you prevent them?

Answer:

1. Prompt regression after model updates Prevention: Monitor system_fingerprint changes; run eval suite automatically when detected; maintain eval pass rates as a deployment gate.

2. Format degradation over long conversations Prevention: Reinforce format in the system prompt with explicit examples; validate outputs structurally; consider a format-restoration step for long sessions.

3. Instruction overload Prevention: Keep system prompts under 500 words; remove instructions that can be replaced by examples; measure instruction compliance in eval suite.

4. Context window overflow in production Prevention: Track token counts at runtime; implement context management (sliding window, summarization); alert when requests approach 80% of context limit.

5. Prompt injection from user-provided documents Prevention: Wrap documents in delimiters; instruct model to treat enclosed content as data; validate outputs for injection signatures.

6. Few-shot examples drifting from policy Prevention: Version control examples with the prompt; validate examples in CI; require examples to pass the same eval criteria as the prompt.

7. Sycophancy bias in clinical or high-stakes settings Prevention: Include anti-sycophancy instructions; add adversarial test cases (present wrong information confidently and verify the model corrects it); use a critical role framing ("You are a skeptical reviewer").

8. Cost overruns from long contexts Prevention: Monitor average input token counts per feature; alert on outliers; implement hard token budget limits with truncation logic.

Interview: Prompt Engineering (Part 1)

Next Lesson

Prompt Injection: Detection & Defense