Tree of Thought Prompting

Why Tree of Thought?

Standard prompting generates one reasoning path and commits to it. Chain-of-thought improves reasoning by making that path explicit — but it still follows a single trajectory. When the first reasoning step is wrong, chain-of-thought confidently produces a wrong answer.

Tree of Thought (ToT) explores multiple reasoning paths in parallel and evaluates which branches look most promising. It's particularly effective for:

Problems with multiple valid intermediate steps
Tasks where backtracking is needed
Creative problems requiring exploration before commitment

Basic ToT Structure

Problem
├── Path A: Start with mechanism
│   ├── A1: Focus on enzyme inhibition → Promising (continue)
│   └── A2: Focus on pharmacokinetics → Dead end (prune)
├── Path B: Start with clinical effect
│   ├── B1: Bleeding risk → Promising (continue)
│   └── B2: Drug interactions → Promising (continue)
└── Path C: Start with patient factors
    └── C1: Renal function → Less relevant for this question (prune)

Implementation: Manual ToT Orchestration

Python

from openai import OpenAI
from typing import Literal

client = OpenAI()

def generate_thoughts(
    problem: str,
    n_thoughts: int = 3,
    previous_thoughts: str = "",
) -> list[str]:
    """Generate N candidate next reasoning steps."""
    context = f"Problem: {problem}"
    if previous_thoughts:
        context += f"\n\nReasoning so far:\n{previous_thoughts}"

    prompt = f"""{context}

Generate {n_thoughts} different approaches or next steps for solving this problem.
Each approach should explore a different direction.
Number them 1 through {n_thoughts}.
Be concise — one paragraph per approach."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8,  # Higher temperature for diverse thoughts
    )

    raw = response.choices[0].message.content
    # Parse numbered list
    thoughts = []
    for i in range(1, n_thoughts + 1):
        start = raw.find(f"{i}.")
        end = raw.find(f"{i+1}.") if i < n_thoughts else len(raw)
        if start != -1:
            thoughts.append(raw[start:end].strip())
    return thoughts

def evaluate_thought(
    problem: str,
    thought_path: str,
) -> tuple[float, str]:
    """Score a reasoning path from 0-10 and explain why."""
    prompt = f"""Problem: {problem}

Reasoning path so far:
{thought_path}

Evaluate this reasoning approach:
1. Is it on track to solve the problem? (0-10)
2. Are there logical errors?
3. What's missing?

Respond with:
SCORE: [0-10]
ASSESSMENT: [one paragraph]"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    content = response.choices[0].message.content
    score_line = [l for l in content.split("\n") if l.startswith("SCORE:")][0]
    score = float(score_line.replace("SCORE:", "").strip())
    assessment = content.split("ASSESSMENT:")[-1].strip()

    return score, assessment

def tree_of_thought(
    problem: str,
    depth: int = 2,
    branching_factor: int = 3,
    beam_width: int = 2,
) -> str:
    """
    BFS-style Tree of Thought with beam search.
    Keeps top beam_width paths at each level.
    """
    # Initialize with empty thought paths
    current_beams = [{"path": "", "score": 5.0}]

    for level in range(depth):
        print(f"\n--- Level {level + 1} ---")
        all_candidates = []

        for beam in current_beams:
            # Generate new thoughts branching from this beam
            new_thoughts = generate_thoughts(
                problem,
                n_thoughts=branching_factor,
                previous_thoughts=beam["path"],
            )

            for thought in new_thoughts:
                new_path = beam["path"] + f"\n[Step {level+1}] {thought}" if beam["path"] else thought
                score, assessment = evaluate_thought(problem, new_path)
                print(f"Score {score:.1f}: {thought[:80]}...")
                all_candidates.append({
                    "path": new_path,
                    "score": score,
                    "assessment": assessment,
                })

        # Keep top beam_width candidates
        all_candidates.sort(key=lambda x: x["score"], reverse=True)
        current_beams = all_candidates[:beam_width]

    # Generate final answer from best path
    best_path = current_beams[0]["path"]

    final_prompt = f"""Problem: {problem}

After exploring multiple reasoning paths, the most promising approach is:

{best_path}

Based on this reasoning, provide a clear, complete final answer."""

    final_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": final_prompt}],
        temperature=0,
    )
    return final_response.choices[0].message.content

# Example
problem = """
A 68-year-old patient on warfarin starts a 10-day course of clarithromycin for pneumonia.
Their last INR was 2.4 (target range 2.0-3.0). How should their anticoagulation be managed?
"""

answer = tree_of_thought(problem, depth=2, branching_factor=3, beam_width=2)
print("\n=== FINAL ANSWER ===")
print(answer)

Simplified ToT: Single-Round Multi-Path

For simpler cases, prompt the model to generate and evaluate its own paths in one call:

Python

def simple_tot_prompt(problem: str) -> str:
    """Single-call ToT: model explores paths and selects best."""
    return f"""Problem: {problem}

Think through this step-by-step using the following process:

1. Generate three different approaches:
   Approach A: [describe a different way to tackle this]
   Approach B: [describe another angle]
   Approach C: [describe a third perspective]

2. Evaluate each approach:
   Approach A: Score [1-10], because [reason]
   Approach B: Score [1-10], because [reason]
   Approach C: Score [1-10], because [reason]

3. Select the best approach and develop a full answer:
   Best approach: [letter]
   Full answer: [complete answer using the chosen approach]"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": simple_tot_prompt(problem)}],
    temperature=0.3,
)
print(response.choices[0].message.content)

When ToT Outperforms Chain-of-Thought

| Task type | CoT | ToT | |---|---|---| | Arithmetic | Excellent | Overkill | | Single-path logic | Good | Overkill | | Multi-step clinical reasoning | Good | Better | | Drug interaction analysis | Good | Better | | Creative problem solving | Adequate | Better | | Puzzle solving (24 game) | Poor | Good | | Treatment planning | Adequate | Better |

ToT significantly outperforms CoT on tasks where:

Multiple valid intermediate approaches exist
The best approach isn't obvious upfront
Mistakes early in reasoning compound (domain: medical decision-making)

Cost consideration: ToT with 3 branches × 2 levels = 6-9 LLM calls vs 1 for CoT. Use it selectively for high-stakes, complex problems.

ToT for Drug Interaction Analysis

Python

def analyze_drug_interactions_tot(patient_medications: list[str], new_drug: str) -> str:
    """Use ToT to systematically analyze drug interactions for complex polypharmacy."""

    problem = f"""
Patient is on the following medications: {', '.join(patient_medications)}

A new drug {new_drug} is being considered.

Analyze all clinically relevant interactions and provide recommendations.
"""

    # Generate three analytical paths:
    # Path 1: Pharmacokinetic interactions (enzyme inhibition/induction, protein binding)
    # Path 2: Pharmacodynamic interactions (additive/synergistic/antagonistic effects)
    # Path 3: Risk stratification by severity and clinical significance

    prompt = f"""{problem}

Analyze this systematically using THREE different lenses:

PHARMACOKINETIC LENS:
Consider metabolism (CYP enzymes), protein binding, renal/hepatic clearance.
What PK interactions exist between {new_drug} and each current medication?

PHARMACODYNAMIC LENS:
Consider mechanisms of action, physiological effects.
Where do mechanisms overlap or antagonize?

CLINICAL RISK LENS:
Considering the patient's medications as a group, rank interactions by severity.
Which interactions require immediate action? Which require monitoring?

SYNTHESIS:
Based on all three analyses, provide:
1. The 2-3 highest priority interactions requiring action
2. Recommended management for each
3. Overall recommendation (proceed/avoid/dose-adjust/monitor)"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,
    )
    return response.choices[0].message.content

Key Principles

Diverse thought generation: Use temperature > 0.7 when generating branches — you want genuinely different approaches, not variations of the same idea.

Honest evaluation: The evaluation prompt must ask the model to critically assess, not just validate. Include "What's missing?" and "Are there errors?" to force critical thinking.

Beam width vs depth: For most practical problems, 2 branches × 2 levels is sufficient. Increasing branching factor beyond 4 adds cost without proportional benefit.

Know when not to use it: ToT adds 5–10× LLM cost. For simple factual questions or well-structured tasks, standard prompting or CoT is more cost-effective.

Tree of Thought Prompting

Why Tree of Thought?

Basic ToT Structure

Implementation: Manual ToT Orchestration

Simplified ToT: Single-Round Multi-Path

When ToT Outperforms Chain-of-Thought

ToT for Drug Interaction Analysis

Key Principles

Enjoyed this article?

Leave a comment