Learnixo

AI Safety & Guardrails · Lesson 4 of 15

Mitigation: Grounding, Self-Check, Citation

Overview

Hallucination mitigation operates at four levels:

  1. Prompt-level — steer the model toward accurate, uncertain-aware responses
  2. Retrieval-augmented — ground answers in source documents with citations
  3. Post-processing — run the output through a fact checker after generation
  4. Confidence scoring — attach a calibrated probability to each claim

These layers are complementary. A production-grade system uses all four.


Layer 1: Prompt Techniques

Chain-of-Thought (CoT)

Forcing the model to reason step-by-step before giving a final answer dramatically reduces logical hallucinations. The model's "scratchpad" exposes its reasoning, which is easier to verify and makes errors more visible.

Python
from anthropic import Anthropic

client = Anthropic()

def ask_with_cot(question: str) -> dict:
    """
    Structured chain-of-thought prompt that forces explicit reasoning
    before committing to an answer.
    """
    cot_prompt = f"""
Answer the following question using this exact structure:

QUESTION: {question}

STEP 1 - Identify what is being asked:
[Write one sentence about what the question requires]

STEP 2 - List what I know with confidence:
[Bullet each fact you are confident about, with a note if you are uncertain]

STEP 3 - Identify what I am uncertain about:
[Explicitly list any gaps in your knowledge]

STEP 4 - Reason through the answer:
[Work through the logic step by step]

STEP 5 - Final answer:
[State your answer clearly, qualifying anything uncertain]

CONFIDENCE: [high / medium / low] — [one sentence explaining why]
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=800,
        messages=[{"role": "user", "content": cot_prompt}]
    )
    
    return {
        "question": question,
        "cot_response": response.content[0].text,
        "technique": "chain-of-thought"
    }

result = ask_with_cot("What is the recommended daily protein intake for adults?")
print(result["cot_response"])

Self-Consistency

Run the same prompt multiple times with non-zero temperature. If the model gives consistent answers, confidence is higher. If answers vary, flag for review.

Python
import anthropic
from collections import Counter

client = anthropic.Anthropic()

def self_consistency_check(
    question: str,
    num_samples: int = 5,
    temperature: float = 0.7
) -> dict:
    """
    Generate multiple answers at non-zero temperature.
    Majority vote on the final answer increases reliability.
    """
    answers = []
    
    for _ in range(num_samples):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            temperature=temperature,
            messages=[{
                "role": "user",
                "content": f"Answer in one sentence: {question}"
            }]
        )
        answers.append(response.content[0].text.strip())
    
    # Count frequency of each unique answer (simplified  real implementation
    # would use semantic clustering, not exact string matching)
    frequency = Counter(answers)
    most_common = frequency.most_common(1)[0]
    agreement_rate = most_common[1] / num_samples
    
    return {
        "question": question,
        "all_answers": answers,
        "consensus_answer": most_common[0],
        "agreement_rate": agreement_rate,
        "confidence": "high" if agreement_rate >= 0.8 else (
            "medium" if agreement_rate >= 0.6 else "low"
        ),
        "flag_for_review": agreement_rate < 0.6
    }

result = self_consistency_check("In what year did Python 3.0 release?", num_samples=5)
print(f"Consensus: {result['consensus_answer']}")
print(f"Agreement: {result['agreement_rate']:.0%}")
print(f"Flag for review: {result['flag_for_review']}")

Explicit Uncertainty Instructions

Instructing the model to explicitly mark uncertain claims significantly reduces confident-sounding hallucinations.

Python
UNCERTAINTY_SYSTEM_PROMPT = """
You are a helpful assistant with one strict rule:

When you are uncertain about a fact, you MUST use one of these markers:
- [UNCERTAIN]: I believe this is correct but am not fully confident
- [VERIFY]: Please verify this claim from a primary source
- [UNKNOWN]: I do not have reliable information about this
- [ESTIMATED]: This is an approximation, not a precise figure

Never state something you are uncertain about without the appropriate marker.
If a question is outside your knowledge, say so directly rather than guessing.
"""

def ask_with_uncertainty(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=400,
        system=UNCERTAINTY_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

answer = ask_with_uncertainty("What is the population of Oslo, Norway?")
print(answer)
# "Oslo has a population of approximately 700,000 people [ESTIMATED — 
#  this figure is from my training data and may not reflect current numbers.
#  Please verify with Statistics Norway (ssb.no) for the current figure.]"

Layer 2: Retrieval-Augmented Techniques

Always Cite Sources

The most effective retrieval constraint: require the model to quote the exact passage it used.

Python
def rag_with_mandatory_quotes(question: str, retrieved_chunks: list[dict]) -> dict:
    """
    Forces the model to include verbatim quotes from source documents.
    Verbatim quotes are verifiable — paraphrases can drift.
    """
    context = "\n\n".join([
        f"[CHUNK {i+1} | Source: {c['source']} | Page: {c.get('page', 'N/A')}]\n{c['text']}"
        for i, c in enumerate(retrieved_chunks)
    ])
    
    system = """
You are a document Q&A assistant. Strict rules:
1. You MUST include at least one verbatim quote from the provided chunks.
   Format quotes as: > "exact text here" [Chunk N, Source: filename]
2. Only answer from the provided document chunks.
3. If chunks don't contain the answer, say: "This is not covered in the provided documents."
4. Do not paraphrase if you can quote directly.
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=system,
        messages=[{
            "role": "user",
            "content": f"Document chunks:\n{context}\n\nQuestion: {question}"
        }]
    )
    
    return {
        "answer": response.content[0].text,
        "sources_provided": [c["source"] for c in retrieved_chunks]
    }

Return Excerpts Alongside Answers

Show users the source excerpt alongside the generated answer so they can verify.

Python
def rag_with_excerpts(question: str, store) -> dict:
    """
    Returns both the answer AND the raw retrieved excerpts.
    Users or downstream systems can verify the answer against the excerpt.
    """
    chunks = store.search(question, top_k=2)
    answer_result = rag_with_mandatory_quotes(question, chunks)
    
    return {
        "answer": answer_result["answer"],
        "supporting_excerpts": [
            {
                "source": c["source"],
                "excerpt": c["text"][:300] + "..." if len(c["text"]) > 300 else c["text"],
                "relevance_score": round(c["score"], 3)
            }
            for c in chunks
        ]
    }

Layer 3: Post-Processing — NLI-Based Fact Checking

Natural Language Inference (NLI) classifies the relationship between two pieces of text:

  • Entailment: premise logically implies hypothesis
  • Contradiction: premise contradicts hypothesis
  • Neutral: no clear logical relationship

In a RAG pipeline, use NLI to check whether the model's answer is entailed by the retrieved context. If the model's output contains claims that contradict or are neutral to the context, flag them.

Python
from transformers import pipeline
import re

# Load a cross-encoder NLI model  better than bi-encoder for this task
nli_pipeline = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-small",
    device=-1  # CPU; use 0 for GPU
)

def split_into_sentences(text: str) -> list[str]:
    """Basic sentence splitter."""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if len(s.strip()) > 10]

def nli_faithfulness_check(
    model_answer: str,
    retrieved_context: str,
    entailment_threshold: float = 0.5
) -> dict:
    """
    Check each sentence in the model's answer against the retrieved context.
    Flag sentences that are not entailed by the context.
    
    Returns a faithfulness report with per-sentence verdicts.
    """
    sentences = split_into_sentences(model_answer)
    results = []
    
    for sentence in sentences:
        # NLI: does context entail this sentence?
        nli_input = f"{retrieved_context} [SEP] {sentence}"
        prediction = nli_pipeline(nli_input, truncation=True, max_length=512)
        
        label = prediction[0]["label"].upper()
        score = prediction[0]["score"]
        
        verdict = "FAITHFUL"
        if label == "CONTRADICTION":
            verdict = "CONTRADICTS_CONTEXT"
        elif label == "NEUTRAL" and score > entailment_threshold:
            verdict = "NOT_IN_CONTEXT"
        
        results.append({
            "sentence": sentence,
            "nli_label": label,
            "nli_score": round(score, 3),
            "verdict": verdict
        })
    
    flagged = [r for r in results if r["verdict"] != "FAITHFUL"]
    overall_faithful = len(flagged) == 0
    
    return {
        "overall_faithful": overall_faithful,
        "faithfulness_score": 1.0 - (len(flagged) / max(len(sentences), 1)),
        "flagged_count": len(flagged),
        "sentence_results": results,
        "action": "PASS" if overall_faithful else "REVIEW_FLAGGED_SENTENCES"
    }


def rag_with_nli_guard(
    question: str,
    retrieved_chunks: list[dict],
    faithfulness_threshold: float = 0.8
) -> dict:
    """
    Full pipeline: retrieve → generate → NLI check → return or flag.
    """
    context = " ".join([c["text"] for c in retrieved_chunks])
    
    # Generate answer
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=400,
        system="Answer only from the provided context.",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    answer = response.content[0].text
    
    # NLI check
    faithfulness = nli_faithfulness_check(answer, context)
    
    if faithfulness["faithfulness_score"] < faithfulness_threshold:
        return {
            "answer": answer,
            "safe_to_show": False,
            "faithfulness": faithfulness,
            "recommendation": "Flag for human review before showing to user"
        }
    
    return {
        "answer": answer,
        "safe_to_show": True,
        "faithfulness": faithfulness
    }

Layer 4: Confidence Scoring

Attach a calibrated confidence estimate to the model's output. This allows downstream systems to decide whether to show the answer, request human review, or escalate.

Python
import json

def generate_with_confidence(question: str, context: str) -> dict:
    """
    Ask the model to self-report confidence AND provide supporting evidence.
    Then validate that the claimed confidence matches the evidence.
    """
    system = """
You are a careful analyst. For every answer, respond with this JSON structure:
{
  "answer": "your answer",
  "confidence": "high|medium|low",
  "confidence_reasons": {
    "supporting": ["reason 1", "reason 2"],
    "against": ["uncertainty 1", "uncertainty 2"]
  },
  "would_recommend_verification": true/false,
  "verification_source": "where to verify this (URL, document name, etc.)"
}
Base confidence on: source quality, recency, specificity, and your certainty.
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=system,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    
    raw = response.content[0].text.strip()
    if raw.startswith("```"):
        lines = raw.split("\n")
        raw = "\n".join(lines[1:-1])
    
    try:
        result = json.loads(raw)
        # Map confidence to numeric score for downstream use
        confidence_map = {"high": 0.9, "medium": 0.6, "low": 0.3}
        result["confidence_score"] = confidence_map.get(result.get("confidence", "low"), 0.3)
        result["show_to_user"] = result["confidence_score"] >= 0.6
        return result
    except json.JSONDecodeError:
        return {
            "answer": raw,
            "confidence": "unknown",
            "confidence_score": 0.0,
            "show_to_user": False,
            "parse_error": True
        }

# Usage
result = generate_with_confidence(
    question="What is the half-life of aspirin in the human body?",
    context="Aspirin (acetylsalicylic acid) has a short half-life of approximately 15-20 minutes..."
)
print(f"Confidence: {result.get('confidence')} ({result.get('confidence_score')})")
print(f"Show to user: {result.get('show_to_user')}")

Putting It All Together: A Four-Layer Pipeline

Python
class HallucinationMitigationPipeline:
    """
    Combines all four mitigation layers into a single pipeline.
    """
    
    def __init__(self, vector_store, use_nli: bool = True):
        self.store = vector_store
        self.use_nli = use_nli
        self.client = Anthropic()
    
    def run(self, question: str) -> dict:
        # Layer 1: Retrieve with relevance check
        chunks = self.store.search(question, top_k=3)
        MIN_RELEVANCE = 0.3
        relevant = [c for c in chunks if c["score"] >= MIN_RELEVANCE]
        
        if not relevant:
            return {
                "answer": "Insufficient information in knowledge base.",
                "confidence_score": 0.0,
                "passed_all_checks": False,
                "failure_reason": "no_relevant_context"
            }
        
        context = " ".join([c["text"] for c in relevant])
        
        # Layer 2: Generate with CoT and uncertainty markers
        cot_response = ask_with_cot(question)
        
        # Layer 3: NLI faithfulness check
        if self.use_nli:
            faithfulness = nli_faithfulness_check(
                cot_response["cot_response"], context
            )
            if not faithfulness["overall_faithful"]:
                return {
                    "answer": cot_response["cot_response"],
                    "confidence_score": 0.2,
                    "passed_all_checks": False,
                    "failure_reason": "nli_faithfulness_check_failed",
                    "faithfulness_report": faithfulness
                }
        
        # Layer 4: Confidence scoring
        confidence = generate_with_confidence(question, context)
        
        return {
            "answer": confidence.get("answer"),
            "confidence_score": confidence.get("confidence_score", 0.0),
            "confidence_reasons": confidence.get("confidence_reasons", {}),
            "passed_all_checks": True,
            "sources": [c["metadata"]["source"] for c in relevant],
            "show_to_user": confidence.get("show_to_user", False)
        }

Summary

| Technique | What It Prevents | When to Use | |---|---|---| | Chain-of-thought | Logical hallucinations | Complex reasoning tasks | | Self-consistency | High-variance factual errors | High-stakes factual Q&A | | Uncertainty markers | Overconfident wrong claims | Any production Q&A system | | RAG with citations | Knowledge-gap hallucinations | Domain-specific knowledge systems | | NLI faithfulness check | Drift from retrieved context | Medical, legal, finance RAG | | Confidence scoring | Showing uncertain answers to users | Any user-facing AI system |

Each layer adds latency and cost. Choose based on the stakes of the domain.