Prompt Engineering Mastery · Lesson 7 of 24
Tree-of-Thought for Complex Reasoning
Why Tree of Thought?
Standard prompting generates one reasoning path and commits to it. Chain-of-thought improves reasoning by making that path explicit — but it still follows a single trajectory. When the first reasoning step is wrong, chain-of-thought confidently produces a wrong answer.
Tree of Thought (ToT) explores multiple reasoning paths in parallel and evaluates which branches look most promising. It's particularly effective for:
- Problems with multiple valid intermediate steps
- Tasks where backtracking is needed
- Creative problems requiring exploration before commitment
Basic ToT Structure
Problem
├── Path A: Start with mechanism
│ ├── A1: Focus on enzyme inhibition → Promising (continue)
│ └── A2: Focus on pharmacokinetics → Dead end (prune)
├── Path B: Start with clinical effect
│ ├── B1: Bleeding risk → Promising (continue)
│ └── B2: Drug interactions → Promising (continue)
└── Path C: Start with patient factors
└── C1: Renal function → Less relevant for this question (prune)Implementation: Manual ToT Orchestration
from openai import OpenAI
from typing import Literal
client = OpenAI()
def generate_thoughts(
problem: str,
n_thoughts: int = 3,
previous_thoughts: str = "",
) -> list[str]:
"""Generate N candidate next reasoning steps."""
context = f"Problem: {problem}"
if previous_thoughts:
context += f"\n\nReasoning so far:\n{previous_thoughts}"
prompt = f"""{context}
Generate {n_thoughts} different approaches or next steps for solving this problem.
Each approach should explore a different direction.
Number them 1 through {n_thoughts}.
Be concise — one paragraph per approach."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.8, # Higher temperature for diverse thoughts
)
raw = response.choices[0].message.content
# Parse numbered list
thoughts = []
for i in range(1, n_thoughts + 1):
start = raw.find(f"{i}.")
end = raw.find(f"{i+1}.") if i < n_thoughts else len(raw)
if start != -1:
thoughts.append(raw[start:end].strip())
return thoughts
def evaluate_thought(
problem: str,
thought_path: str,
) -> tuple[float, str]:
"""Score a reasoning path from 0-10 and explain why."""
prompt = f"""Problem: {problem}
Reasoning path so far:
{thought_path}
Evaluate this reasoning approach:
1. Is it on track to solve the problem? (0-10)
2. Are there logical errors?
3. What's missing?
Respond with:
SCORE: [0-10]
ASSESSMENT: [one paragraph]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
content = response.choices[0].message.content
score_line = [l for l in content.split("\n") if l.startswith("SCORE:")][0]
score = float(score_line.replace("SCORE:", "").strip())
assessment = content.split("ASSESSMENT:")[-1].strip()
return score, assessment
def tree_of_thought(
problem: str,
depth: int = 2,
branching_factor: int = 3,
beam_width: int = 2,
) -> str:
"""
BFS-style Tree of Thought with beam search.
Keeps top beam_width paths at each level.
"""
# Initialize with empty thought paths
current_beams = [{"path": "", "score": 5.0}]
for level in range(depth):
print(f"\n--- Level {level + 1} ---")
all_candidates = []
for beam in current_beams:
# Generate new thoughts branching from this beam
new_thoughts = generate_thoughts(
problem,
n_thoughts=branching_factor,
previous_thoughts=beam["path"],
)
for thought in new_thoughts:
new_path = beam["path"] + f"\n[Step {level+1}] {thought}" if beam["path"] else thought
score, assessment = evaluate_thought(problem, new_path)
print(f"Score {score:.1f}: {thought[:80]}...")
all_candidates.append({
"path": new_path,
"score": score,
"assessment": assessment,
})
# Keep top beam_width candidates
all_candidates.sort(key=lambda x: x["score"], reverse=True)
current_beams = all_candidates[:beam_width]
# Generate final answer from best path
best_path = current_beams[0]["path"]
final_prompt = f"""Problem: {problem}
After exploring multiple reasoning paths, the most promising approach is:
{best_path}
Based on this reasoning, provide a clear, complete final answer."""
final_response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": final_prompt}],
temperature=0,
)
return final_response.choices[0].message.content
# Example
problem = """
A 68-year-old patient on warfarin starts a 10-day course of clarithromycin for pneumonia.
Their last INR was 2.4 (target range 2.0-3.0). How should their anticoagulation be managed?
"""
answer = tree_of_thought(problem, depth=2, branching_factor=3, beam_width=2)
print("\n=== FINAL ANSWER ===")
print(answer)Simplified ToT: Single-Round Multi-Path
For simpler cases, prompt the model to generate and evaluate its own paths in one call:
def simple_tot_prompt(problem: str) -> str:
"""Single-call ToT: model explores paths and selects best."""
return f"""Problem: {problem}
Think through this step-by-step using the following process:
1. Generate three different approaches:
Approach A: [describe a different way to tackle this]
Approach B: [describe another angle]
Approach C: [describe a third perspective]
2. Evaluate each approach:
Approach A: Score [1-10], because [reason]
Approach B: Score [1-10], because [reason]
Approach C: Score [1-10], because [reason]
3. Select the best approach and develop a full answer:
Best approach: [letter]
Full answer: [complete answer using the chosen approach]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": simple_tot_prompt(problem)}],
temperature=0.3,
)
print(response.choices[0].message.content)When ToT Outperforms Chain-of-Thought
| Task type | CoT | ToT | |---|---|---| | Arithmetic | Excellent | Overkill | | Single-path logic | Good | Overkill | | Multi-step clinical reasoning | Good | Better | | Drug interaction analysis | Good | Better | | Creative problem solving | Adequate | Better | | Puzzle solving (24 game) | Poor | Good | | Treatment planning | Adequate | Better |
ToT significantly outperforms CoT on tasks where:
- Multiple valid intermediate approaches exist
- The best approach isn't obvious upfront
- Mistakes early in reasoning compound (domain: medical decision-making)
Cost consideration: ToT with 3 branches × 2 levels = 6-9 LLM calls vs 1 for CoT. Use it selectively for high-stakes, complex problems.
ToT for Drug Interaction Analysis
def analyze_drug_interactions_tot(patient_medications: list[str], new_drug: str) -> str:
"""Use ToT to systematically analyze drug interactions for complex polypharmacy."""
problem = f"""
Patient is on the following medications: {', '.join(patient_medications)}
A new drug {new_drug} is being considered.
Analyze all clinically relevant interactions and provide recommendations.
"""
# Generate three analytical paths:
# Path 1: Pharmacokinetic interactions (enzyme inhibition/induction, protein binding)
# Path 2: Pharmacodynamic interactions (additive/synergistic/antagonistic effects)
# Path 3: Risk stratification by severity and clinical significance
prompt = f"""{problem}
Analyze this systematically using THREE different lenses:
PHARMACOKINETIC LENS:
Consider metabolism (CYP enzymes), protein binding, renal/hepatic clearance.
What PK interactions exist between {new_drug} and each current medication?
PHARMACODYNAMIC LENS:
Consider mechanisms of action, physiological effects.
Where do mechanisms overlap or antagonize?
CLINICAL RISK LENS:
Considering the patient's medications as a group, rank interactions by severity.
Which interactions require immediate action? Which require monitoring?
SYNTHESIS:
Based on all three analyses, provide:
1. The 2-3 highest priority interactions requiring action
2. Recommended management for each
3. Overall recommendation (proceed/avoid/dose-adjust/monitor)"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
)
return response.choices[0].message.contentKey Principles
Diverse thought generation: Use temperature > 0.7 when generating branches — you want genuinely different approaches, not variations of the same idea.
Honest evaluation: The evaluation prompt must ask the model to critically assess, not just validate. Include "What's missing?" and "Are there errors?" to force critical thinking.
Beam width vs depth: For most practical problems, 2 branches × 2 levels is sufficient. Increasing branching factor beyond 4 adds cost without proportional benefit.
Know when not to use it: ToT adds 5–10× LLM cost. For simple factual questions or well-structured tasks, standard prompting or CoT is more cost-effective.