Few-Shot Prompting
Provide examples in the prompt — input/output pairs, how many to use, choosing diverse examples, and chain of thought in few-shot.
Few-Shot Prompting
Few-shot prompting is the technique of including a small number of input/output examples directly in the prompt, before the actual task input. The model learns the pattern, format, and style from these examples and applies it to the new input.
This is one of the most powerful techniques in the prompt engineer's toolkit. It can transform a model's output quality dramatically — not by changing the model's weights, but by showing it exactly what you want.
The Core Mechanism
When you include examples in the prompt, you are exploiting the model's in-context learning ability. The Transformer architecture can recognize patterns in the sequence of tokens and generalize from them — even within a single forward pass, without gradient updates.
Think of it as: every example you provide is a data point that shapes the model's interpretation of your task.
Basic Few-Shot Format
[Task description]
Input: [example 1 input]
Output: [example 1 output]
Input: [example 2 input]
Output: [example 2 output]
Input: [example 3 input]
Output: [example 3 output]
Input: [actual input]
Output:The model sees the pattern "Input: X → Output: Y" repeated and completes the final "Output:" accordingly.
Example: Clinical Note Triage Classification
Without few-shot:
Classify the urgency of this clinical note: "Patient complains of mild headache for 2 days."The model might give a verbose paragraph instead of a clean label.
With few-shot:
Classify clinical notes into urgency tiers: STAT (immediate), URGENT (within 4 hours), ROUTINE (within 48 hours).
Input: "Patient with crushing chest pain radiating to left arm, diaphoretic."
Output: STAT
Input: "Patient has fever 39.2°C and productive cough for 3 days."
Output: URGENT
Input: "Patient requests medication refill for chronic hypertension."
Output: ROUTINE
Input: "Patient complains of mild headache for 2 days, no neurological symptoms."
Output:Output: ROUTINE
The model now outputs exactly the right format — no explanation, no hedging, just the label.
How Many Examples to Use
The research consensus and practical experience point to 3 to 5 examples as the sweet spot for most tasks:
- 1 example: Helps with format but provides almost no task learning.
- 3 examples: Usually enough to establish pattern, format, and edge cases.
- 5 examples: Recommended for complex classification or specialized formats.
- 10 or more examples: Diminishing returns; may push useful context out of the window; consider fine-tuning instead.
There are exceptions:
- If your classes/categories number more than 5, try to include at least one example per class.
- For very complex reasoning tasks, more examples help — up to the model's context limit.
- For simple format tasks (JSON extraction), 2 examples are often sufficient.
Choosing Diverse, Representative Examples
The quality of your examples matters far more than the quantity. Bad examples actively hurt performance.
Principles for good example selection:
1. Cover the space of variation
If your inputs vary in length, formality, domain, or complexity, your examples should reflect that variation. Don't use 3 examples that are all short, informal, and simple.
2. Include edge cases
If you expect some inputs to be ambiguous or borderline, include an example that shows how to handle ambiguity:
Input: "Patient reports fatigue and mild shortness of breath when climbing stairs."
Output: URGENT # Note: dyspnea warrants prompt evaluation even when mild3. Balance across classes
For classification tasks, aim for roughly equal representation across categories. If you include 4 ROUTINE examples and 1 STAT example, the model will be biased toward ROUTINE.
4. Use real data when possible
Examples from your actual domain (anonymized if needed) are almost always better than examples you wrote yourself. They capture authentic phrasing and edge cases.
Python: Dynamic Few-Shot with Example Selection
import openai
from dataclasses import dataclass
import random
client = openai.OpenAI()
@dataclass
class Example:
input_text: str
output_text: str
category: str # for balanced selection
EXAMPLE_POOL = [
Example("Crushing chest pain, diaphoretic, arm radiation", "STAT", "STAT"),
Example("Altered mental status, confusion, new onset", "STAT", "STAT"),
Example("Anaphylaxis, throat swelling after bee sting", "STAT", "STAT"),
Example("Fever 39.5C, rigors, suspected sepsis", "URGENT", "URGENT"),
Example("Worsening dyspnea, O2 sat 91% on room air", "URGENT", "URGENT"),
Example("New onset atrial fibrillation, hemodynamically stable", "URGENT", "URGENT"),
Example("Medication refill request, stable patient", "ROUTINE", "ROUTINE"),
Example("Annual physical exam scheduling", "ROUTINE", "ROUTINE"),
Example("Follow-up for chronic low back pain", "ROUTINE", "ROUTINE"),
]
def select_balanced_examples(pool: list[Example], n_per_class: int = 1) -> list[Example]:
"""Select balanced examples across categories."""
categories = set(e.category for e in pool)
selected = []
for cat in categories:
cat_examples = [e for e in pool if e.category == cat]
selected.extend(random.sample(cat_examples, min(n_per_class, len(cat_examples))))
random.shuffle(selected)
return selected
def build_few_shot_prompt(examples: list[Example], task_input: str) -> str:
lines = [
"Classify clinical notes into: STAT (immediate), URGENT (within 4 hours), ROUTINE (within 48 hours).\n"
]
for ex in examples:
lines.append(f"Input: {ex.input_text}")
lines.append(f"Output: {ex.output_text}\n")
lines.append(f"Input: {task_input}")
lines.append("Output:")
return "\n".join(lines)
def classify_clinical_note(note: str) -> str:
examples = select_balanced_examples(EXAMPLE_POOL, n_per_class=1)
prompt = build_few_shot_prompt(examples, note)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=10,
)
return response.choices[0].message.content.strip()
# Test
test_notes = [
"Patient with sudden severe headache, worst of life, neck stiffness.",
"Mild rash on forearm for 3 days, no systemic symptoms.",
"Chest tightness and palpitations for 2 hours, hemodynamically stable.",
]
for note in test_notes:
label = classify_clinical_note(note)
print(f"[{label}] {note}")Chain of Thought in Few-Shot Examples
Standard few-shot examples show input → output directly. But for reasoning-heavy tasks, you can include the intermediate reasoning steps in your examples. This is called few-shot chain of thought.
Classify the medication interaction risk. Show your reasoning, then give the final risk level.
Input: Patient on warfarin (INR 2.5) prescribed ciprofloxacin for a UTI.
Reasoning: Ciprofloxacin inhibits CYP1A2 and has some CYP2C9 interaction.
Warfarin is primarily metabolized by CYP2C9. Fluoroquinolones can significantly
increase INR. This is a clinically significant interaction requiring INR monitoring.
Output: HIGH — Monitor INR within 2-3 days of starting ciprofloxacin.
Input: Patient on lisinopril 10mg daily prescribed ibuprofen for knee pain.
Reasoning: NSAIDs reduce renal prostaglandins, causing sodium/water retention.
This counteracts ACE inhibitor's antihypertensive effect and can reduce GFR.
Combined use increases risk of acute kidney injury, especially in elderly patients.
Output: MODERATE — Prefer acetaminophen; if NSAID required, monitor renal function and BP.
Input: Patient on metformin prescribed contrast dye for CT scan.
Reasoning:The model will generate reasoning in the same format before giving its answer. This dramatically improves accuracy on complex medical, legal, and mathematical tasks.
Formatting Few-Shot Examples
Your delimiter choice affects parsing reliability. Common patterns:
Pattern 1: Label prefix
Input: text here
Output: label herePattern 2: XML-style delimiters (more robust)
<example>
<input>text here</input>
<output>label here</output>
</example>Pattern 3: Markdown headers
### Example 1
**Input:** text here
**Output:** label herePattern 4: Q&A (for conversational tasks)
Q: question here
A: answer hereThe XML-style is most reliable for complex prompts because it is unambiguous — the model can clearly identify where each example starts and ends.
Few-Shot for Structured Output
Few-shot is especially powerful for teaching the model a complex output schema:
import openai
import json
client = openai.OpenAI()
DRUG_EXTRACTION_EXAMPLES = """
Extract medication information from clinical notes.
Input: "Start metformin 500mg twice daily with meals. Titrate to 1000mg BID over 4 weeks."
Output: {"medications": [{"name": "metformin", "dose_mg": 500, "frequency": "BID", "route": "oral", "instructions": "with meals", "titration": "increase to 1000mg BID over 4 weeks"}]}
Input: "Continue lisinopril 10mg daily. Add amlodipine 5mg daily for BP control."
Output: {"medications": [{"name": "lisinopril", "dose_mg": 10, "frequency": "daily", "route": "oral", "instructions": null, "titration": null}, {"name": "amlodipine", "dose_mg": 5, "frequency": "daily", "route": "oral", "instructions": "for BP control", "titration": null}]}
Input: "Prescribe amoxicillin 875mg PO BID for 7 days for sinusitis."
Output: {"medications": [{"name": "amoxicillin", "dose_mg": 875, "frequency": "BID", "route": "oral", "instructions": "7 days for sinusitis", "titration": null}]}
"""
def extract_medications(note: str) -> dict:
prompt = DRUG_EXTRACTION_EXAMPLES + f'\nInput: "{note}"\nOutput:'
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a clinical NLP system. Always return valid JSON matching the schema shown in examples.",
},
{"role": "user", "content": prompt},
],
temperature=0.0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
note = "Add empagliflozin 10mg daily to regimen. Decrease glipizide to 5mg daily."
result = extract_medications(note)
print(json.dumps(result, indent=2))Measuring Few-Shot Quality
Always evaluate few-shot prompts on a held-out test set before deploying.
from sklearn.metrics import classification_report
def evaluate_classifier(
test_cases: list[tuple[str, str]],
classifier_fn,
) -> None:
"""Evaluate a prompt-based classifier on labeled test cases."""
y_true, y_pred = [], []
for input_text, true_label in test_cases:
pred_label = classifier_fn(input_text)
y_true.append(true_label.upper())
y_pred.append(pred_label.upper())
print(f"Expected: {true_label:8} | Got: {pred_label:8} | Input: {input_text[:50]}")
print("\n--- Classification Report ---")
print(classification_report(y_true, y_pred))
# Test data
test_data = [
("Sudden vision loss in one eye, no pain", "STAT"),
("Patient wants flu vaccine", "ROUTINE"),
("Fever 38.8C, dysuria, flank pain — pyelonephritis suspected", "URGENT"),
("Routine cholesterol check requested", "ROUTINE"),
("New onset seizure, post-ictal", "STAT"),
]
evaluate_classifier(test_data, classify_clinical_note)Summary
Few-shot prompting is the single highest-leverage technique for improving output quality without changing the model. Remember:
- Use 3 to 5 examples for most tasks
- Balance examples across categories
- Include edge cases that reflect real ambiguity
- For reasoning tasks, include the reasoning steps in your examples (few-shot CoT)
- Use XML or labeled delimiters for reliable parsing
- Always evaluate on held-out data before deploying
The next lesson covers chain-of-thought prompting — the technique of making the model show its work, which dramatically improves accuracy on multi-step reasoning tasks.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.