Prompt Engineering Mastery · Lesson 5 of 24
Chain-of-Thought: Let's Think Step by Step
Chain of Thought Prompting
Chain of Thought (CoT) prompting is the technique of instructing a language model to produce explicit intermediate reasoning steps before arriving at a final answer. Instead of jumping directly to the answer, the model "thinks out loud" — and this dramatically improves accuracy on tasks requiring multi-step reasoning.
The foundational insight: language models generate tokens sequentially. Each token is conditioned on all previous tokens. When you force the model to write out its reasoning, the reasoning tokens become context for the answer token — effectively giving the model more compute budget to work with.
The Breakthrough Paper
In 2022, Wei et al. at Google Brain published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." They showed that simply adding reasoning chains to few-shot examples caused models to dramatically improve on math word problems, commonsense reasoning, and symbolic reasoning benchmarks.
The key result: CoT only helps models above roughly 100B parameters. On smaller models, it does not help and can hurt. On GPT-4-class models, CoT is consistently beneficial for reasoning tasks.
Zero-Shot Chain of Thought
The simplest CoT technique requires no examples. You just append a trigger phrase to your prompt:
"Let's think step by step.""Think through this carefully before answering.""Reason through each step before giving your final answer."
Without CoT:
A hospital has 3 wards. Ward A has 12 patients on drug X.
Ward B has twice as many patients on drug X as Ward A.
Ward C has 5 fewer patients on drug X than Ward B.
How many patients total are on drug X?Model output: 29 (often wrong due to arithmetic errors)
With zero-shot CoT:
A hospital has 3 wards. Ward A has 12 patients on drug X.
Ward B has twice as many patients on drug X as Ward A.
Ward C has 5 fewer patients on drug X than Ward B.
How many patients total are on drug X?
Let's think step by step.Model output:
Step 1: Ward A has 12 patients.
Step 2: Ward B has twice as many as Ward A: 12 × 2 = 24 patients.
Step 3: Ward C has 5 fewer than Ward B: 24 - 5 = 19 patients.
Step 4: Total = 12 + 24 + 19 = 55 patients.
Answer: 55 patients are on drug X.The reasoning makes each arithmetic step explicit, and the model conditions each step on the previous one — dramatically reducing errors.
Extracting the Final Answer
When using CoT, the model produces a reasoning chain plus a final answer. You often need to extract just the answer for downstream processing.
import openai
import re
client = openai.OpenAI()
def cot_reasoning(question: str) -> tuple[str, str]:
"""
Returns (full_reasoning, extracted_answer).
Instructs model to end with 'Final Answer: X' for easy extraction.
"""
prompt = f"""{question}
Let's think step by step. At the end, write "Final Answer: [your answer]" on its own line."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=600,
)
full_text = response.choices[0].message.content.strip()
# Extract final answer
match = re.search(r"Final Answer:\s*(.+)", full_text, re.IGNORECASE)
answer = match.group(1).strip() if match else full_text.split("\n")[-1]
return full_text, answer
question = """
A patient weighs 82 kg. The recommended dose of vancomycin is 25 mg/kg every 8 hours.
The available vials contain 500 mg in 10 mL.
How many mL should be drawn up for each dose?
"""
reasoning, answer = cot_reasoning(question)
print("=== REASONING ===")
print(reasoning)
print("\n=== FINAL ANSWER ===")
print(answer)Few-Shot Chain of Thought
In few-shot CoT, your examples include the full reasoning chain — not just the input and final answer. This teaches the model both the task and the reasoning style.
Classify the drug interaction risk and explain your reasoning.
Input: Warfarin + aspirin 325mg
Reasoning: Warfarin is an anticoagulant that inhibits vitamin K-dependent clotting factors.
Aspirin irreversibly inhibits platelet COX-1, reducing platelet aggregation.
Together they have additive bleeding risk. High-dose aspirin also displaces warfarin
from plasma proteins, potentially increasing free warfarin levels.
This combination significantly increases major bleeding risk.
Risk Level: HIGH
Recommendation: Avoid combination unless benefit clearly outweighs risk; prefer low-dose aspirin 81mg if required.
Input: Metformin + lisinopril
Reasoning: Metformin reduces hepatic glucose production and improves insulin sensitivity.
Lisinopril is an ACE inhibitor used for hypertension and nephroprotection.
No direct pharmacokinetic or pharmacodynamic interaction exists between them.
Both are commonly co-prescribed in diabetic patients with hypertension.
Risk Level: LOW
Recommendation: No significant interaction. Monitor renal function as both can affect kidneys.
Input: Methotrexate + trimethoprim
Reasoning:The model will generate reasoning in the established format, then produce risk level and recommendation. This approach is far more accurate than asking for the answer directly, because the model cannot skip steps — each reasoning step conditions the next.
Few-Shot CoT with Python Code
import openai
client = openai.OpenAI()
DRUG_INTERACTION_EXAMPLES = """
Classify the drug interaction risk and explain your reasoning.
Input: Warfarin + aspirin 325mg
Reasoning: Warfarin inhibits vitamin K-dependent clotting factors. Aspirin irreversibly inhibits platelet COX-1. Together: additive bleeding risk. High-dose aspirin may also increase free warfarin levels via protein displacement. Combination significantly increases major hemorrhage risk.
Risk Level: HIGH
Recommendation: Avoid; if required use aspirin 81mg; monitor INR closely.
Input: Metformin + lisinopril
Reasoning: No direct PK or PD interaction. Both used routinely together in T2DM with hypertension. No dose adjustments needed. Monitor renal function as both can affect GFR.
Risk Level: LOW
Recommendation: No significant interaction; safe to co-prescribe.
Input: Simvastatin + clarithromycin
Reasoning: Clarithromycin is a strong CYP3A4 inhibitor. Simvastatin is a CYP3A4 substrate. Co-administration can increase simvastatin plasma levels 5-10x, dramatically increasing myopathy and rhabdomyolysis risk.
Risk Level: HIGH
Recommendation: Contraindicated. Suspend simvastatin during clarithromycin course; use pravastatin (not CYP3A4 dependent) if statin needed.
"""
def assess_drug_interaction(drug1: str, drug2: str) -> str:
prompt = DRUG_INTERACTION_EXAMPLES + f"Input: {drug1} + {drug2}\nReasoning:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a clinical pharmacist. Follow the exact reasoning format shown in the examples.",
},
{"role": "user", "content": prompt},
],
temperature=0.1,
max_tokens=400,
)
return "Reasoning:" + response.choices[0].message.content
print(assess_drug_interaction("Clopidogrel", "omeprazole"))Why Chain of Thought Works
There are two complementary explanations:
1. Compute budget theory
Transformers process tokens in parallel within a layer, but generate tokens sequentially (autoregressive). Each generated token represents a "reasoning step" the model can use. Without CoT, the model has only the input context to derive an answer. With CoT, it generates intermediate tokens that are then available as context for the final answer token.
This is why CoT helps on tasks that require multiple sequential steps (math, logic) but provides less benefit on tasks that require a single lookup (factual recall, translation).
2. Pattern matching to training data
The internet contains enormous amounts of worked-problem solutions — textbooks, math forums, legal memos, scientific papers. These all show step-by-step reasoning. When you trigger CoT, you activate these patterns. The model is essentially retrieving and instantiating a "how to solve this type of problem" template from its weights.
When CoT Helps vs. Hurts
| Scenario | Use CoT? | Why | |---|---|---| | Multi-step arithmetic | Yes | Each step builds on the last | | Logical deduction (if A then B...) | Yes | Avoids jumping to wrong conclusions | | Medical/legal reasoning with criteria | Yes | Forces explicit checklist traversal | | Creative writing | No | CoT breaks flow; the "reasoning" is unnecessary overhead | | Simple classification | No | Adds tokens without improving accuracy | | Factual lookup | No | The model either knows the fact or not; CoT doesn't help | | Very short context window | No | CoT consumes tokens you may not have |
Controlling CoT Output Format
For production systems, you want the model to produce structured CoT that's easy to parse:
def structured_cot(question: str, criteria: list[str]) -> dict:
"""
Forces the model to address each criterion explicitly before answering.
Returns a dict with reasoning per criterion and a final answer.
"""
criteria_list = "\n".join(f"{i+1}. {c}" for i, c in enumerate(criteria))
prompt = f"""Answer the question below. Address each criterion explicitly before giving your final answer.
Question: {question}
Criteria to address:
{criteria_list}
Format your response as:
Criterion 1: [your analysis]
Criterion 2: [your analysis]
...
Final Answer: [your answer]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
text = response.choices[0].message.content
result = {}
for i, criterion in enumerate(criteria):
pattern = rf"Criterion {i+1}:\s*(.*?)(?=Criterion {i+2}:|Final Answer:|$)"
match = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
result[f"criterion_{i+1}"] = match.group(1).strip() if match else ""
final_match = re.search(r"Final Answer:\s*(.*)", text, re.DOTALL | re.IGNORECASE)
result["final_answer"] = final_match.group(1).strip() if final_match else ""
return result
# Example: PICO analysis for evidence-based medicine
result = structured_cot(
question="Should we add an SGLT2 inhibitor to a 72-year-old with HFrEF and eGFR 32?",
criteria=[
"Patient eligibility based on label and guidelines",
"Renal dosing considerations",
"Benefit-risk balance in this patient profile",
"Monitoring requirements",
]
)
for k, v in result.items():
print(f"\n[{k}]\n{v}")Automatic Triggering
For a mixed-task pipeline where some inputs need CoT and some don't, you can add a routing layer:
ROUTING_PROMPT = """Determine if this question requires multi-step reasoning (YES) or can be answered directly (NO).
Question: {question}
Respond with exactly YES or NO."""
def needs_cot(question: str) -> bool:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": ROUTING_PROMPT.format(question=question)}
],
temperature=0.0,
max_tokens=5,
)
return "YES" in response.choices[0].message.content.upper()
def smart_answer(question: str) -> str:
if needs_cot(question):
_, answer = cot_reasoning(question)
else:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
temperature=0.0,
)
answer = response.choices[0].message.content.strip()
return answerSummary
Chain of thought prompting is one of the highest-impact techniques in prompt engineering for reasoning tasks. Key takeaways:
- Append "Let's think step by step." for instant zero-shot CoT.
- Use few-shot CoT when you want to control the reasoning style and format.
- Always end with "Final Answer: X" for easy programmatic extraction.
- CoT works by giving the model intermediate tokens to condition on — effectively extending its compute budget.
- Do not use CoT for creative writing, simple classification, or factual lookups — it adds cost with no benefit.
Next up: Tree of Thought — extending CoT by exploring multiple reasoning branches in parallel and selecting the best.