Self-Consistency: Majority Voting for Reasoning
Sample multiple reasoning paths and select the most consistent answer. Self-consistency improves accuracy on complex reasoning tasks without requiring human labels.
The Core Idea
Chain-of-thought prompting generates one reasoning path and one answer. The answer may be wrong if the single reasoning path makes a mistake.
Self-consistency (Wang et al., 2022) generates N independent reasoning paths (at temperature > 0 to introduce diversity), then takes a majority vote on the final answers. The intuition: if multiple independent reasoning chains all arrive at the same answer, that answer is more likely correct ā even if some individual chains contain errors.
Path 1: [reasoning] ā Answer: 2.5mg warfarin
Path 2: [reasoning] ā Answer: 2.5mg warfarin
Path 3: [reasoning] ā Answer: 5.0mg warfarin ā Minority
Path 4: [reasoning] ā Answer: 2.5mg warfarin
Path 5: [reasoning] ā Answer: 2.5mg warfarin
Majority vote: 2.5mg warfarin (4 out of 5)Implementation
from openai import OpenAI
from collections import Counter
import re
client = OpenAI()
def sample_reasoning_paths(
question: str,
n_samples: int = 5,
temperature: float = 0.7,
cot_prefix: str = "Let's think step by step.",
) -> list[str]:
"""Generate N diverse chain-of-thought responses."""
paths = []
for i in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": f"{question}\n\n{cot_prefix}",
}
],
temperature=temperature,
)
paths.append(response.choices[0].message.content)
return paths
def extract_final_answer(reasoning_path: str) -> str:
"""Extract the final answer from a chain-of-thought response."""
# Look for explicit "The answer is" or "Therefore" patterns
patterns = [
r"(?:the answer is|final answer:|therefore the answer is|so the answer is)\s*([^\n.]+)",
r"(?:answer:|result:|conclusion:)\s*([^\n.]+)",
]
lower_path = reasoning_path.lower()
for pattern in patterns:
match = re.search(pattern, lower_path)
if match:
return match.group(1).strip()
# Fall back: use the last sentence
sentences = [s.strip() for s in reasoning_path.split(".") if s.strip()]
return sentences[-1] if sentences else reasoning_path.strip()
def self_consistency(
question: str,
n_samples: int = 5,
temperature: float = 0.7,
) -> dict:
"""Run self-consistency and return the majority answer with vote counts."""
paths = sample_reasoning_paths(question, n_samples, temperature)
answers = []
for path in paths:
answer = extract_final_answer(path)
answers.append(answer)
# Count votes
vote_counts = Counter(answers)
most_common = vote_counts.most_common()
majority_answer = most_common[0][0]
majority_count = most_common[0][1]
return {
"majority_answer": majority_answer,
"confidence": majority_count / n_samples,
"vote_distribution": dict(vote_counts),
"reasoning_paths": paths,
}
# Example: complex dosing question
question = """
A patient with atrial fibrillation on warfarin 5mg daily has an INR of 1.6 (target 2.0-3.0).
They have eGFR of 45 mL/min. What warfarin dose adjustment would you recommend?
"""
result = self_consistency(question, n_samples=5, temperature=0.7)
print(f"Majority answer: {result['majority_answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Votes: {result['vote_distribution']}")When Self-Consistency Helps
Self-consistency is most effective for questions with well-defined correct answers that can be expressed consistently:
# GOOD candidates for self-consistency
good_cases = [
"Calculate the CrCl for a 70kg 65-year-old woman with creatinine 1.4 mg/dL.", # Numerical
"Which CYP enzyme primarily metabolizes warfarin's S-enantiomer?", # Factual
"A patient needs a 30% dose reduction of metformin. Starting dose is 2000mg. New dose?", # Math
]
# POOR candidates for self-consistency (subjective, open-ended)
poor_cases = [
"Write a patient education leaflet about warfarin.", # Many valid outputs
"Explain the history of anticoagulation therapy.", # No single right answer
"Summarize this clinical note.", # Paraphrase task
]Empirically, self-consistency provides the most benefit for:
- Multi-step arithmetic and calculation
- Complex logical reasoning
- Factual questions with definitive answers
Aggregating Semantic Equivalents
Simple string matching misses semantically equivalent answers. Use an LLM to cluster similar answers:
def aggregate_with_semantic_clustering(
question: str,
answers: list[str],
) -> dict:
"""Cluster semantically equivalent answers and vote."""
if len(set(answers)) == 1:
return {"winner": answers[0], "confidence": 1.0}
# Ask the model to group equivalent answers
grouping_prompt = f"""Question: {question}
These are candidate answers:
{chr(10).join(f"{i+1}. {ans}" for i, ans in enumerate(answers))}
Group these answers by semantic equivalence ā answers that say the same thing should be in the same group.
Return a JSON list of groups:
[
{{"group_label": "increase dose by 20%", "answer_numbers": [1, 3, 4]}},
{{"group_label": "increase dose to 6mg", "answer_numbers": [2, 5]}}
]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": grouping_prompt}],
response_format={"type": "json_object"},
temperature=0,
)
import json
groups = json.loads(response.choices[0].message.content)
# Find largest group
largest_group = max(groups, key=lambda g: len(g.get("answer_numbers", [])))
confidence = len(largest_group["answer_numbers"]) / len(answers)
return {
"winner": largest_group["group_label"],
"confidence": confidence,
"groups": groups,
}
# Example
answers = [
"Increase warfarin to 6mg daily",
"I recommend raising the warfarin dose to 6 mg each day",
"Consider increasing to 7mg",
"The warfarin dose should be increased to 6mg/day",
"Increase to 6mg daily",
]
result = aggregate_with_semantic_clustering(question, answers)
print(f"Winner: {result['winner']} (confidence: {result['confidence']:.0%})")Self-Consistency with Medical Calculations
Self-consistency is especially useful for multi-step calculations where a single error in one path derails the answer:
def verify_calculation(
problem: str,
n_samples: int = 7, # Odd number for clear majority
) -> dict:
"""Use self-consistency to verify a clinical calculation."""
# Use structured prompting for calculation tasks
prompt = f"""Solve this clinical calculation step by step, showing all work:
{problem}
End with "FINAL ANSWER: [value with units]" on its own line."""
paths = []
for _ in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.5, # Lower temperature for calculations ā less diversity needed
)
paths.append(response.choices[0].message.content)
# Extract final answers
final_answers = []
for path in paths:
match = re.search(r"FINAL ANSWER:\s*(.+)", path)
if match:
final_answers.append(match.group(1).strip())
vote_counts = Counter(final_answers)
majority_answer = vote_counts.most_common(1)[0][0]
confidence = vote_counts.most_common(1)[0][1] / len(final_answers)
return {
"answer": majority_answer,
"confidence": confidence,
"vote_distribution": dict(vote_counts),
"all_paths": paths,
}
result = verify_calculation("""
Calculate the creatinine clearance for:
- 68-year-old male
- Weight: 75 kg
- Serum creatinine: 1.4 mg/dL
Using the Cockcroft-Gault formula.
""", n_samples=7)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")Cost vs Accuracy Tradeoff
def self_consistency_cost_analysis(n_samples: int, model: str = "gpt-4o") -> dict:
"""Estimate cost of self-consistency vs single inference."""
# Rough estimates (check current pricing)
cost_per_1k_input = {"gpt-4o": 0.0025, "gpt-4o-mini": 0.00015}
cost_per_1k_output = {"gpt-4o": 0.010, "gpt-4o-mini": 0.0006}
avg_input_tokens = 500 # Prompt + question
avg_output_tokens = 300 # Reasoning + answer
single_cost = (
(avg_input_tokens / 1000 * cost_per_1k_input[model]) +
(avg_output_tokens / 1000 * cost_per_1k_output[model])
)
return {
"single_inference_cost": single_cost,
"self_consistency_cost": single_cost * n_samples,
"cost_multiplier": n_samples,
"accuracy_gain_typical": "5-15% on complex reasoning tasks",
}
print(cost_analysis := self_consistency_cost_analysis(5, "gpt-4o"))
# Self-consistency costs 5Ć more ā use judiciously for high-stakes decisionsRule of thumb: Use self-consistency (5ā7 samples) when the cost of a wrong answer significantly exceeds the cost of extra inference ā clinical calculations, legal reasoning, financial decisions.
For most conversational use cases, single inference with good chain-of-thought is sufficient.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.