Learnixo
Back to blog
AI Systemsintermediate

Few-Shot Prompting

Provide examples in the prompt — input/output pairs, how many to use, choosing diverse examples, and chain of thought in few-shot.

Asma Hafeez KhanMay 15, 20268 min read
Prompt EngineeringLLMFew-ShotIn-Context LearningExamples
Share:𝕏

Few-Shot Prompting

Few-shot prompting is the technique of including a small number of input/output examples directly in the prompt, before the actual task input. The model learns the pattern, format, and style from these examples and applies it to the new input.

This is one of the most powerful techniques in the prompt engineer's toolkit. It can transform a model's output quality dramatically — not by changing the model's weights, but by showing it exactly what you want.

The Core Mechanism

When you include examples in the prompt, you are exploiting the model's in-context learning ability. The Transformer architecture can recognize patterns in the sequence of tokens and generalize from them — even within a single forward pass, without gradient updates.

Think of it as: every example you provide is a data point that shapes the model's interpretation of your task.

Basic Few-Shot Format

TEXT
[Task description]

Input: [example 1 input]
Output: [example 1 output]

Input: [example 2 input]
Output: [example 2 output]

Input: [example 3 input]
Output: [example 3 output]

Input: [actual input]
Output:

The model sees the pattern "Input: X → Output: Y" repeated and completes the final "Output:" accordingly.

Example: Clinical Note Triage Classification

Without few-shot:

TEXT
Classify the urgency of this clinical note: "Patient complains of mild headache for 2 days."

The model might give a verbose paragraph instead of a clean label.

With few-shot:

TEXT
Classify clinical notes into urgency tiers: STAT (immediate), URGENT (within 4 hours), ROUTINE (within 48 hours).

Input: "Patient with crushing chest pain radiating to left arm, diaphoretic."
Output: STAT

Input: "Patient has fever 39.2°C and productive cough for 3 days."
Output: URGENT

Input: "Patient requests medication refill for chronic hypertension."
Output: ROUTINE

Input: "Patient complains of mild headache for 2 days, no neurological symptoms."
Output:

Output: ROUTINE

The model now outputs exactly the right format — no explanation, no hedging, just the label.

How Many Examples to Use

The research consensus and practical experience point to 3 to 5 examples as the sweet spot for most tasks:

  • 1 example: Helps with format but provides almost no task learning.
  • 3 examples: Usually enough to establish pattern, format, and edge cases.
  • 5 examples: Recommended for complex classification or specialized formats.
  • 10 or more examples: Diminishing returns; may push useful context out of the window; consider fine-tuning instead.

There are exceptions:

  • If your classes/categories number more than 5, try to include at least one example per class.
  • For very complex reasoning tasks, more examples help — up to the model's context limit.
  • For simple format tasks (JSON extraction), 2 examples are often sufficient.

Choosing Diverse, Representative Examples

The quality of your examples matters far more than the quantity. Bad examples actively hurt performance.

Principles for good example selection:

1. Cover the space of variation

If your inputs vary in length, formality, domain, or complexity, your examples should reflect that variation. Don't use 3 examples that are all short, informal, and simple.

2. Include edge cases

If you expect some inputs to be ambiguous or borderline, include an example that shows how to handle ambiguity:

TEXT
Input: "Patient reports fatigue and mild shortness of breath when climbing stairs."
Output: URGENT  # Note: dyspnea warrants prompt evaluation even when mild

3. Balance across classes

For classification tasks, aim for roughly equal representation across categories. If you include 4 ROUTINE examples and 1 STAT example, the model will be biased toward ROUTINE.

4. Use real data when possible

Examples from your actual domain (anonymized if needed) are almost always better than examples you wrote yourself. They capture authentic phrasing and edge cases.

Python: Dynamic Few-Shot with Example Selection

Python
import openai
from dataclasses import dataclass
import random

client = openai.OpenAI()

@dataclass
class Example:
    input_text: str
    output_text: str
    category: str  # for balanced selection


EXAMPLE_POOL = [
    Example("Crushing chest pain, diaphoretic, arm radiation", "STAT", "STAT"),
    Example("Altered mental status, confusion, new onset", "STAT", "STAT"),
    Example("Anaphylaxis, throat swelling after bee sting", "STAT", "STAT"),
    Example("Fever 39.5C, rigors, suspected sepsis", "URGENT", "URGENT"),
    Example("Worsening dyspnea, O2 sat 91% on room air", "URGENT", "URGENT"),
    Example("New onset atrial fibrillation, hemodynamically stable", "URGENT", "URGENT"),
    Example("Medication refill request, stable patient", "ROUTINE", "ROUTINE"),
    Example("Annual physical exam scheduling", "ROUTINE", "ROUTINE"),
    Example("Follow-up for chronic low back pain", "ROUTINE", "ROUTINE"),
]


def select_balanced_examples(pool: list[Example], n_per_class: int = 1) -> list[Example]:
    """Select balanced examples across categories."""
    categories = set(e.category for e in pool)
    selected = []
    for cat in categories:
        cat_examples = [e for e in pool if e.category == cat]
        selected.extend(random.sample(cat_examples, min(n_per_class, len(cat_examples))))
    random.shuffle(selected)
    return selected


def build_few_shot_prompt(examples: list[Example], task_input: str) -> str:
    lines = [
        "Classify clinical notes into: STAT (immediate), URGENT (within 4 hours), ROUTINE (within 48 hours).\n"
    ]
    for ex in examples:
        lines.append(f"Input: {ex.input_text}")
        lines.append(f"Output: {ex.output_text}\n")
    lines.append(f"Input: {task_input}")
    lines.append("Output:")
    return "\n".join(lines)


def classify_clinical_note(note: str) -> str:
    examples = select_balanced_examples(EXAMPLE_POOL, n_per_class=1)
    prompt = build_few_shot_prompt(examples, note)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        max_tokens=10,
    )
    return response.choices[0].message.content.strip()


# Test
test_notes = [
    "Patient with sudden severe headache, worst of life, neck stiffness.",
    "Mild rash on forearm for 3 days, no systemic symptoms.",
    "Chest tightness and palpitations for 2 hours, hemodynamically stable.",
]
for note in test_notes:
    label = classify_clinical_note(note)
    print(f"[{label}] {note}")

Chain of Thought in Few-Shot Examples

Standard few-shot examples show input → output directly. But for reasoning-heavy tasks, you can include the intermediate reasoning steps in your examples. This is called few-shot chain of thought.

TEXT
Classify the medication interaction risk. Show your reasoning, then give the final risk level.

Input: Patient on warfarin (INR 2.5) prescribed ciprofloxacin for a UTI.
Reasoning: Ciprofloxacin inhibits CYP1A2 and has some CYP2C9 interaction.
Warfarin is primarily metabolized by CYP2C9. Fluoroquinolones can significantly
increase INR. This is a clinically significant interaction requiring INR monitoring.
Output: HIGH  Monitor INR within 2-3 days of starting ciprofloxacin.

Input: Patient on lisinopril 10mg daily prescribed ibuprofen for knee pain.
Reasoning: NSAIDs reduce renal prostaglandins, causing sodium/water retention.
This counteracts ACE inhibitor's antihypertensive effect and can reduce GFR.
Combined use increases risk of acute kidney injury, especially in elderly patients.
Output: MODERATE — Prefer acetaminophen; if NSAID required, monitor renal function and BP.

Input: Patient on metformin prescribed contrast dye for CT scan.
Reasoning:

The model will generate reasoning in the same format before giving its answer. This dramatically improves accuracy on complex medical, legal, and mathematical tasks.

Formatting Few-Shot Examples

Your delimiter choice affects parsing reliability. Common patterns:

Pattern 1: Label prefix

TEXT
Input: text here
Output: label here

Pattern 2: XML-style delimiters (more robust)

TEXT
<example>
<input>text here</input>
<output>label here</output>
</example>

Pattern 3: Markdown headers

TEXT
### Example 1
**Input:** text here
**Output:** label here

Pattern 4: Q&A (for conversational tasks)

TEXT
Q: question here
A: answer here

The XML-style is most reliable for complex prompts because it is unambiguous — the model can clearly identify where each example starts and ends.

Few-Shot for Structured Output

Few-shot is especially powerful for teaching the model a complex output schema:

Python
import openai
import json

client = openai.OpenAI()

DRUG_EXTRACTION_EXAMPLES = """
Extract medication information from clinical notes.

Input: "Start metformin 500mg twice daily with meals. Titrate to 1000mg BID over 4 weeks."
Output: {"medications": [{"name": "metformin", "dose_mg": 500, "frequency": "BID", "route": "oral", "instructions": "with meals", "titration": "increase to 1000mg BID over 4 weeks"}]}

Input: "Continue lisinopril 10mg daily. Add amlodipine 5mg daily for BP control."
Output: {"medications": [{"name": "lisinopril", "dose_mg": 10, "frequency": "daily", "route": "oral", "instructions": null, "titration": null}, {"name": "amlodipine", "dose_mg": 5, "frequency": "daily", "route": "oral", "instructions": "for BP control", "titration": null}]}

Input: "Prescribe amoxicillin 875mg PO BID for 7 days for sinusitis."
Output: {"medications": [{"name": "amoxicillin", "dose_mg": 875, "frequency": "BID", "route": "oral", "instructions": "7 days for sinusitis", "titration": null}]}
"""

def extract_medications(note: str) -> dict:
    prompt = DRUG_EXTRACTION_EXAMPLES + f'\nInput: "{note}"\nOutput:'

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a clinical NLP system. Always return valid JSON matching the schema shown in examples.",
            },
            {"role": "user", "content": prompt},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)


note = "Add empagliflozin 10mg daily to regimen. Decrease glipizide to 5mg daily."
result = extract_medications(note)
print(json.dumps(result, indent=2))

Measuring Few-Shot Quality

Always evaluate few-shot prompts on a held-out test set before deploying.

Python
from sklearn.metrics import classification_report

def evaluate_classifier(
    test_cases: list[tuple[str, str]],
    classifier_fn,
) -> None:
    """Evaluate a prompt-based classifier on labeled test cases."""
    y_true, y_pred = [], []
    for input_text, true_label in test_cases:
        pred_label = classifier_fn(input_text)
        y_true.append(true_label.upper())
        y_pred.append(pred_label.upper())
        print(f"Expected: {true_label:8} | Got: {pred_label:8} | Input: {input_text[:50]}")

    print("\n--- Classification Report ---")
    print(classification_report(y_true, y_pred))


# Test data
test_data = [
    ("Sudden vision loss in one eye, no pain", "STAT"),
    ("Patient wants flu vaccine", "ROUTINE"),
    ("Fever 38.8C, dysuria, flank pain — pyelonephritis suspected", "URGENT"),
    ("Routine cholesterol check requested", "ROUTINE"),
    ("New onset seizure, post-ictal", "STAT"),
]

evaluate_classifier(test_data, classify_clinical_note)

Summary

Few-shot prompting is the single highest-leverage technique for improving output quality without changing the model. Remember:

  • Use 3 to 5 examples for most tasks
  • Balance examples across categories
  • Include edge cases that reflect real ambiguity
  • For reasoning tasks, include the reasoning steps in your examples (few-shot CoT)
  • Use XML or labeled delimiters for reliable parsing
  • Always evaluate on held-out data before deploying

The next lesson covers chain-of-thought prompting — the technique of making the model show its work, which dramatically improves accuracy on multi-step reasoning tasks.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.