Self-Consistency: Majority Voting for Reasoning

The Core Idea

Chain-of-thought prompting generates one reasoning path and one answer. The answer may be wrong if the single reasoning path makes a mistake.

Self-consistency (Wang et al., 2022) generates N independent reasoning paths (at temperature > 0 to introduce diversity), then takes a majority vote on the final answers. The intuition: if multiple independent reasoning chains all arrive at the same answer, that answer is more likely correct — even if some individual chains contain errors.

Path 1: [reasoning] → Answer: 2.5mg warfarin
Path 2: [reasoning] → Answer: 2.5mg warfarin
Path 3: [reasoning] → Answer: 5.0mg warfarin  ← Minority
Path 4: [reasoning] → Answer: 2.5mg warfarin
Path 5: [reasoning] → Answer: 2.5mg warfarin

Majority vote: 2.5mg warfarin (4 out of 5)

Implementation

Python

from openai import OpenAI
from collections import Counter
import re

client = OpenAI()

def sample_reasoning_paths(
    question: str,
    n_samples: int = 5,
    temperature: float = 0.7,
    cot_prefix: str = "Let's think step by step.",
) -> list[str]:
    """Generate N diverse chain-of-thought responses."""
    paths = []

    for i in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": f"{question}\n\n{cot_prefix}",
                }
            ],
            temperature=temperature,
        )
        paths.append(response.choices[0].message.content)

    return paths

def extract_final_answer(reasoning_path: str) -> str:
    """Extract the final answer from a chain-of-thought response."""
    # Look for explicit "The answer is" or "Therefore" patterns
    patterns = [
        r"(?:the answer is|final answer:|therefore the answer is|so the answer is)\s*([^\n.]+)",
        r"(?:answer:|result:|conclusion:)\s*([^\n.]+)",
    ]

    lower_path = reasoning_path.lower()
    for pattern in patterns:
        match = re.search(pattern, lower_path)
        if match:
            return match.group(1).strip()

    # Fall back: use the last sentence
    sentences = [s.strip() for s in reasoning_path.split(".") if s.strip()]
    return sentences[-1] if sentences else reasoning_path.strip()

def self_consistency(
    question: str,
    n_samples: int = 5,
    temperature: float = 0.7,
) -> dict:
    """Run self-consistency and return the majority answer with vote counts."""
    paths = sample_reasoning_paths(question, n_samples, temperature)

    answers = []
    for path in paths:
        answer = extract_final_answer(path)
        answers.append(answer)

    # Count votes
    vote_counts = Counter(answers)
    most_common = vote_counts.most_common()
    majority_answer = most_common[0][0]
    majority_count = most_common[0][1]

    return {
        "majority_answer": majority_answer,
        "confidence": majority_count / n_samples,
        "vote_distribution": dict(vote_counts),
        "reasoning_paths": paths,
    }

# Example: complex dosing question
question = """
A patient with atrial fibrillation on warfarin 5mg daily has an INR of 1.6 (target 2.0-3.0).
They have eGFR of 45 mL/min. What warfarin dose adjustment would you recommend?
"""

result = self_consistency(question, n_samples=5, temperature=0.7)
print(f"Majority answer: {result['majority_answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Votes: {result['vote_distribution']}")

When Self-Consistency Helps

Self-consistency is most effective for questions with well-defined correct answers that can be expressed consistently:

Python

# GOOD candidates for self-consistency
good_cases = [
    "Calculate the CrCl for a 70kg 65-year-old woman with creatinine 1.4 mg/dL.",  # Numerical
    "Which CYP enzyme primarily metabolizes warfarin's S-enantiomer?",               # Factual
    "A patient needs a 30% dose reduction of metformin. Starting dose is 2000mg. New dose?",  # Math
]

# POOR candidates for self-consistency (subjective, open-ended)
poor_cases = [
    "Write a patient education leaflet about warfarin.",     # Many valid outputs
    "Explain the history of anticoagulation therapy.",       # No single right answer
    "Summarize this clinical note.",                         # Paraphrase task
]

Empirically, self-consistency provides the most benefit for:

Multi-step arithmetic and calculation
Complex logical reasoning
Factual questions with definitive answers

Aggregating Semantic Equivalents

Simple string matching misses semantically equivalent answers. Use an LLM to cluster similar answers:

Python

def aggregate_with_semantic_clustering(
    question: str,
    answers: list[str],
) -> dict:
    """Cluster semantically equivalent answers and vote."""
    if len(set(answers)) == 1:
        return {"winner": answers[0], "confidence": 1.0}

    # Ask the model to group equivalent answers
    grouping_prompt = f"""Question: {question}

These are candidate answers:
{chr(10).join(f"{i+1}. {ans}" for i, ans in enumerate(answers))}

Group these answers by semantic equivalence — answers that say the same thing should be in the same group.
Return a JSON list of groups:
[
  {{"group_label": "increase dose by 20%", "answer_numbers": [1, 3, 4]}},
  {{"group_label": "increase dose to 6mg", "answer_numbers": [2, 5]}}
]"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": grouping_prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    import json
    groups = json.loads(response.choices[0].message.content)

    # Find largest group
    largest_group = max(groups, key=lambda g: len(g.get("answer_numbers", [])))
    confidence = len(largest_group["answer_numbers"]) / len(answers)

    return {
        "winner": largest_group["group_label"],
        "confidence": confidence,
        "groups": groups,
    }

# Example
answers = [
    "Increase warfarin to 6mg daily",
    "I recommend raising the warfarin dose to 6 mg each day",
    "Consider increasing to 7mg",
    "The warfarin dose should be increased to 6mg/day",
    "Increase to 6mg daily",
]

result = aggregate_with_semantic_clustering(question, answers)
print(f"Winner: {result['winner']} (confidence: {result['confidence']:.0%})")

Self-Consistency with Medical Calculations

Self-consistency is especially useful for multi-step calculations where a single error in one path derails the answer:

Python

def verify_calculation(
    problem: str,
    n_samples: int = 7,  # Odd number for clear majority
) -> dict:
    """Use self-consistency to verify a clinical calculation."""

    # Use structured prompting for calculation tasks
    prompt = f"""Solve this clinical calculation step by step, showing all work:

{problem}

End with "FINAL ANSWER: [value with units]" on its own line."""

    paths = []
    for _ in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.5,  # Lower temperature for calculations — less diversity needed
        )
        paths.append(response.choices[0].message.content)

    # Extract final answers
    final_answers = []
    for path in paths:
        match = re.search(r"FINAL ANSWER:\s*(.+)", path)
        if match:
            final_answers.append(match.group(1).strip())

    vote_counts = Counter(final_answers)
    majority_answer = vote_counts.most_common(1)[0][0]
    confidence = vote_counts.most_common(1)[0][1] / len(final_answers)

    return {
        "answer": majority_answer,
        "confidence": confidence,
        "vote_distribution": dict(vote_counts),
        "all_paths": paths,
    }

result = verify_calculation("""
Calculate the creatinine clearance for:
- 68-year-old male
- Weight: 75 kg
- Serum creatinine: 1.4 mg/dL
Using the Cockcroft-Gault formula.
""", n_samples=7)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")

Cost vs Accuracy Tradeoff

Python

def self_consistency_cost_analysis(n_samples: int, model: str = "gpt-4o") -> dict:
    """Estimate cost of self-consistency vs single inference."""
    # Rough estimates (check current pricing)
    cost_per_1k_input = {"gpt-4o": 0.0025, "gpt-4o-mini": 0.00015}
    cost_per_1k_output = {"gpt-4o": 0.010, "gpt-4o-mini": 0.0006}

    avg_input_tokens = 500   # Prompt + question
    avg_output_tokens = 300  # Reasoning + answer

    single_cost = (
        (avg_input_tokens / 1000 * cost_per_1k_input[model]) +
        (avg_output_tokens / 1000 * cost_per_1k_output[model])
    )

    return {
        "single_inference_cost": single_cost,
        "self_consistency_cost": single_cost * n_samples,
        "cost_multiplier": n_samples,
        "accuracy_gain_typical": "5-15% on complex reasoning tasks",
    }

print(cost_analysis := self_consistency_cost_analysis(5, "gpt-4o"))
# Self-consistency costs 5× more — use judiciously for high-stakes decisions

Rule of thumb: Use self-consistency (5–7 samples) when the cost of a wrong answer significantly exceeds the cost of extra inference — clinical calculations, legal reasoning, financial decisions.

For most conversational use cases, single inference with good chain-of-thought is sufficient.

Self-Consistency: Majority Voting for Reasoning

The Core Idea

Implementation

When Self-Consistency Helps

Aggregating Semantic Equivalents

Self-Consistency with Medical Calculations

Cost vs Accuracy Tradeoff

Enjoyed this article?

Leave a comment