Self-Reflection Pattern

LLMs make mistakes. They hallucinate facts, miss edge cases, and sometimes produce subtly wrong answers that sound confident. For low-stakes applications, you might accept this. For high-stakes domains — medical information, legal advice, financial analysis, code that will run in production — you need a better answer.

The Self-Reflection pattern addresses this by adding a critique step: after the agent generates an answer, it evaluates that answer against a set of criteria, identifies weaknesses, and then refines the answer based on the critique. This loop repeats until the answer meets a quality threshold or hits a maximum iteration count.

The Generator-Critic-Refiner Loop

User Query
    │
    ▼
Generator LLM ──► Initial Answer
                        │
                        ▼
                  Critic LLM ──► Critique + Score
                        │
                   Score ≥ threshold?
                   Yes ──────────────► Return Answer
                   No
                        │
                        ▼
                  Refiner LLM ──► Improved Answer
                        │
                        ▼
                  (back to Critic)
                  ... max 3 iterations

Three LLM calls per iteration, but only one is the expensive generator. The critic and refiner can use cheaper models.

Why This Works

Self-critique leverages a key asymmetry in LLM capability: it is easier to evaluate an answer than to generate a perfect one on the first try. A model that cannot generate a perfect response cold might nonetheless correctly identify problems in a flawed response when presented with it.

This is why code review exists in software engineering. You may not write bug-free code, but with a review pass you catch most issues. Self-reflection is the same idea applied to LLM outputs.

Research from Constitutional AI (Anthropic, 2022) and Self-Refine (Madaan et al., 2023) both demonstrate that iterative self-improvement consistently raises output quality on tasks from math to code to essay writing.

When to Apply Self-Reflection

Use it when:

Factual accuracy is critical (medical, scientific)
Legal or financial consequences exist
Code will run in production without human review
Output format must be exact (JSON schema, structured reports)
Creative quality matters and you can afford the latency

Skip it when:

Latency budget is tight (each iteration adds one or more LLM calls)
The task is low-stakes (casual chat, simple lookups)
The generator model is already producing high-quality output
You are doing exploratory research where "good enough" is fine

Critic Prompt Design

The critic prompt is the most important part of this pattern. A vague critic produces vague critique that does not help the refiner. A well-designed critic prompt specifies exact evaluation criteria:

Python

def build_critic_prompt(question: str, answer: str, domain: str) -> str:
    criteria = {
        "medical": [
            "Are all dosage figures accurate and within safe ranges?",
            "Are drug interactions mentioned where relevant?",
            "Is the advice appropriately cautious? Does it recommend consulting a doctor?",
            "Is any claim made that cannot be verified?",
            "Is the answer free from fabricated drug names or studies?",
        ],
        "code": [
            "Does the code have any syntax errors?",
            "Are there potential runtime errors (null pointer, index out of bounds)?",
            "Does the code handle edge cases (empty input, negative numbers, etc.)?",
            "Is the logic correct for the stated problem?",
            "Are there security vulnerabilities (SQL injection, eval on user input)?",
        ],
        "legal": [
            "Does the answer avoid giving specific legal advice?",
            "Are jurisdictional differences mentioned where relevant?",
            "Is the answer free from definitive statements about case outcomes?",
            "Does it recommend consulting a qualified lawyer?",
        ],
    }

    domain_criteria = criteria.get(domain, ["Is the answer accurate?", "Is it complete?"])
    criteria_list = "\n".join(f"- {c}" for c in domain_criteria)

    return f"""You are a rigorous critic evaluating an AI-generated answer.

Question: {question}

Answer to evaluate:
{answer}

Evaluate the answer against these criteria:
{criteria_list}

Provide your evaluation in this exact JSON format:
{{
  "score": <integer 1-10>,
  "issues": ["issue 1", "issue 2", ...],
  "strengths": ["strength 1", "strength 2", ...],
  "suggestion": "One concrete suggestion for the most important improvement"
}}

Score 10 means the answer is excellent with no issues.
Score 7-9 means good but improvable.
Score 4-6 means significant issues that should be fixed.
Score 1-3 means major problems — the answer may be harmful or wrong.

Return ONLY valid JSON.
"""

Full Implementation

Python

import json
import openai
from dataclasses import dataclass
from typing import Optional

client = openai.OpenAI()


@dataclass
class CritiqueResult:
    score: int
    issues: list[str]
    strengths: list[str]
    suggestion: str


@dataclass
class ReflectionResult:
    final_answer: str
    iterations: int
    final_score: int
    history: list[dict]


def generate(question: str, domain: str, context: Optional[str] = None) -> str:
    """Generate an initial answer to the question."""
    system_prompts = {
        "medical": (
            "You are a medical information assistant. Provide accurate, "
            "well-sourced medical information. Always recommend consulting "
            "a qualified healthcare provider for personal medical decisions."
        ),
        "code": (
            "You are an expert software engineer. Write clean, correct, "
            "well-documented code. Include error handling and edge cases."
        ),
        "legal": (
            "You are a legal information assistant. Provide general legal "
            "information only. Always recommend consulting a qualified lawyer "
            "for specific legal advice."
        ),
    }

    system = system_prompts.get(domain, "You are a helpful assistant.")
    user_message = question
    if context:
        user_message = f"Context from previous critique:\n{context}\n\nQuestion: {question}"

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user_message},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content


def critique(question: str, answer: str, domain: str) -> CritiqueResult:
    """Evaluate the answer and return a structured critique."""
    prompt = build_critic_prompt(question, answer, domain)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    data = json.loads(response.choices[0].message.content)
    return CritiqueResult(
        score=data.get("score", 5),
        issues=data.get("issues", []),
        strengths=data.get("strengths", []),
        suggestion=data.get("suggestion", ""),
    )


def refine(question: str, current_answer: str, critique_result: CritiqueResult) -> str:
    """Improve the answer based on the critique."""
    issues_text = "\n".join(f"- {issue}" for issue in critique_result.issues)
    refine_prompt = f"""You previously provided this answer to a question:

Question: {question}

Your previous answer:
{current_answer}

A critic identified these issues:
{issues_text}

The critic's main suggestion: {critique_result.suggestion}

Please provide an improved answer that addresses all identified issues.
Keep what was good and fix what was wrong. Be accurate, complete, and clear.
"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": refine_prompt}],
        temperature=0.2,
    )
    return response.choices[0].message.content


def run_self_reflection(
    question: str,
    domain: str = "medical",
    quality_threshold: int = 8,
    max_iterations: int = 3,
) -> ReflectionResult:
    """
    Run the generator-critic-refiner loop.

    Args:
        question: The user's question
        domain: Domain for domain-specific critique criteria
        quality_threshold: Stop when score reaches this (1-10)
        max_iterations: Never exceed this many refinement passes

    Returns:
        ReflectionResult with the final answer and metadata
    """
    history = []

    print(f"Question: {question}")
    print(f"Domain: {domain}, Threshold: {quality_threshold}/10, Max iterations: {max_iterations}\n")

    # Step 1: Generate initial answer
    print("=== GENERATING INITIAL ANSWER ===")
    current_answer = generate(question, domain)
    print(f"Initial answer (first 200 chars): {current_answer[:200]}...\n")

    for iteration in range(max_iterations):
        print(f"=== CRITIQUE ITERATION {iteration + 1} ===")

        # Step 2: Critique
        critique_result = critique(question, current_answer, domain)
        print(f"Score: {critique_result.score}/10")
        print(f"Issues: {critique_result.issues}")
        print(f"Suggestion: {critique_result.suggestion}\n")

        history.append({
            "iteration": iteration + 1,
            "answer_preview": current_answer[:200],
            "score": critique_result.score,
            "issues": critique_result.issues,
        })

        # Step 3: Check if quality threshold is met
        if critique_result.score >= quality_threshold:
            print(f"Quality threshold reached ({critique_result.score} >= {quality_threshold}). Done.")
            return ReflectionResult(
                final_answer=current_answer,
                iterations=iteration + 1,
                final_score=critique_result.score,
                history=history,
            )

        # Step 4: Refine if we have iterations left
        if iteration < max_iterations - 1:
            print(f"=== REFINING (iteration {iteration + 1}) ===")
            current_answer = refine(question, current_answer, critique_result)
            print(f"Refined answer (first 200 chars): {current_answer[:200]}...\n")
        else:
            print("Max iterations reached. Returning best answer so far.")

    # Final critique after last refinement
    final_critique = critique(question, current_answer, domain)
    history.append({
        "iteration": max_iterations,
        "answer_preview": current_answer[:200],
        "score": final_critique.score,
        "issues": final_critique.issues,
    })

    return ReflectionResult(
        final_answer=current_answer,
        iterations=max_iterations,
        final_score=final_critique.score,
        history=history,
    )


if __name__ == "__main__":
    result = run_self_reflection(
        question=(
            "What is the maximum safe dose of acetaminophen (paracetamol) "
            "for a healthy adult, and what are the signs of overdose?"
        ),
        domain="medical",
        quality_threshold=8,
        max_iterations=3,
    )

    print("\n=== FINAL RESULT ===")
    print(f"Iterations: {result.iterations}")
    print(f"Final score: {result.final_score}/10")
    print(f"\nFinal Answer:\n{result.final_answer}")

    print("\n=== ITERATION HISTORY ===")
    for h in result.history:
        print(f"Iteration {h['iteration']}: score={h['score']}, issues={h['issues']}")

Domain-Specific Critic Templates

The power of this pattern scales with the quality of the critic prompt. Here is a code-focused critic:

Python

CODE_CRITIC_TEMPLATE = """You are a senior software engineer reviewing this code.

Original task: {task}

Code to review:
{code}

Check for:
1. Correctness: Does it solve the stated task?
2. Edge cases: Empty input, None values, zero, negative numbers
3. Error handling: Are exceptions caught appropriately?
4. Security: eval on user input, SQL injection, path traversal
5. Performance: O(n^2) where O(n) would work, unnecessary API calls
6. Readability: Clear variable names, appropriate comments

Return JSON:
{{
  "score": <1-10>,
  "issues": ["specific issue 1", ...],
  "strengths": ["what is good", ...],
  "suggestion": "most important fix"
}}
"""

Tracking Improvement

Log score progression to measure the value of self-reflection in your system:

Python

def analyze_improvement(history: list[dict]) -> dict:
    """Compute improvement statistics from a reflection run."""
    scores = [h["score"] for h in history]
    return {
        "initial_score": scores[0],
        "final_score": scores[-1],
        "improvement": scores[-1] - scores[0],
        "iterations_needed": len(scores),
        "score_progression": scores,
    }

Track this across many runs. If self-reflection consistently improves scores by 2 or more points, the latency cost is worth it. If it rarely changes scores, your generator is already strong enough and you can skip the critic loop.

Summary

Self-Reflection adds a critic pass after generation to catch errors before the answer reaches the user
The critic prompt should specify exact domain-appropriate criteria — vague critiques produce vague fixes
The loop stops when the score meets a quality threshold OR max iterations is reached
Use cheaper models for critique and refinement — only generation needs the strongest model
Track score progression to measure whether self-reflection is actually helping
This pattern is essential for medical, legal, financial, and production code generation

Next: Tool Use in Agentic Systems — how agents interact with the world through structured tool registries.

Self-Reflection Pattern

Self-Reflection Pattern

The Generator-Critic-Refiner Loop

Why This Works

When to Apply Self-Reflection

Critic Prompt Design

Full Implementation

Domain-Specific Critic Templates

Tracking Improvement

Summary

Enjoyed this article?

Leave a comment