Agents & Tools Interview Prep · Lesson 8 of 12
Handling Tool Errors and Retries
Why Error Handling Matters in Agent Loops
In a standard API call, an exception propagates up the stack and the caller decides what to do. In an agent loop, an unhandled exception in a tool function breaks the entire conversation. The LLM never sees what went wrong, cannot adapt, and the user gets a crash rather than an explanation.
The correct pattern: tools never raise exceptions to the caller. They catch all failures internally and return a structured error dict that the LLM can read and respond to intelligently.
The Core Pattern: Try/Except in Every Tool
import json
import logging
logger = logging.getLogger(__name__)
def get_patient_record(patient_id: str) -> dict:
"""
Always returns a dict — never raises.
On success: {"success": True, "data": {...}}
On failure: {"success": False, "error": "...", "hint": "..."}
"""
try:
conn = get_db_connection()
record = conn.execute(
"SELECT * FROM patients WHERE patient_id = %s",
(patient_id,)
).fetchone()
if record is None:
return {
"success": False,
"error": "Patient not found",
"patient_id": patient_id,
"hint": "Verify the patient ID format (P-NNNNN) and try again."
}
return {
"success": True,
"data": dict(record)
}
except ConnectionError as e:
logger.error("DB connection failed in get_patient_record: %s", e)
return {
"success": False,
"error": "Database temporarily unavailable",
"patient_id": patient_id,
"hint": "Try again in a few seconds. If the problem persists, contact IT."
}
except Exception as e:
logger.exception("Unexpected error in get_patient_record for %s", patient_id)
return {
"success": False,
"error": "Unexpected error",
"detail": str(e),
"patient_id": patient_id
}The LLM reads the error dict, understands what went wrong, and responds accordingly — e.g., "I wasn't able to find patient P-99999. Could you double-check the ID?"
Error Response Format
Use a consistent error format across all tools so the LLM learns to recognize them:
def make_error(
error: str,
hint: str = None,
retry_suggested: bool = False,
**extra
) -> dict:
"""Standard error response builder."""
result = {
"success": False,
"error": error,
}
if hint:
result["hint"] = hint
if retry_suggested:
result["retry_suggested"] = True
result.update(extra)
return result
# Usage
return make_error(
"Drug not found in formulary",
hint="Try the generic name instead of the brand name.",
drug_name=drug_name
)
return make_error(
"External API timeout",
hint="The service is slow. Retry once.",
retry_suggested=True
)When retry_suggested is True, the LLM (with appropriate system prompting) will retry the tool call automatically.
Retry Logic Inside the Tool
Some failures are transient and should be retried immediately — network timeouts, temporary database unavailability, rate limit responses from external APIs.
import time
import httpx
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def fetch_drug_from_external_api(
drug_name: str,
max_retries: int = 3,
base_delay: float = 0.5
) -> dict:
"""
Fetches drug information from an external API.
Retries on timeout and 5xx errors with exponential backoff.
"""
url = "https://api.rxnorm.nlm.nih.gov/REST/drugs.json"
for attempt in range(1, max_retries + 1):
try:
with httpx.Client(timeout=5.0) as client:
response = client.get(url, params={"name": drug_name})
if response.status_code == 200:
data = response.json()
return {"success": True, "data": data, "attempts": attempt}
if response.status_code == 429:
# Rate limited — wait longer
wait = base_delay * (2 ** attempt)
logger.warning("Rate limited on attempt %d. Waiting %.1fs", attempt, wait)
time.sleep(wait)
continue
if response.status_code >= 500:
# Server error — retry
wait = base_delay * (2 ** (attempt - 1))
logger.warning("Server error %d on attempt %d. Waiting %.1fs",
response.status_code, attempt, wait)
time.sleep(wait)
continue
# Client error (400-range) — don't retry
return make_error(
f"API returned {response.status_code}",
hint="Check the drug name spelling.",
drug_name=drug_name
)
except httpx.TimeoutException:
wait = base_delay * (2 ** (attempt - 1))
logger.warning("Timeout on attempt %d/%d. Waiting %.1fs", attempt, max_retries, wait)
if attempt < max_retries:
time.sleep(wait)
continue
except httpx.RequestError as e:
logger.error("Request error: %s", e)
return make_error("Network error", detail=str(e), drug_name=drug_name)
# All retries exhausted
logger.error("All %d attempts failed for drug: %s", max_retries, drug_name)
return make_error(
"External service unavailable after multiple retries",
hint="Try again later or check the service status.",
drug_name=drug_name,
attempts_made=max_retries
)Max Retries + Fallback Behavior
Some tools have a fallback — if the primary source fails, try a secondary one.
def get_drug_info_with_fallback(drug_name: str) -> dict:
"""
Try primary database first, fall back to external API,
fall back to cached data if both fail.
"""
# Attempt 1: Internal database (fastest, most reliable)
primary = query_internal_formulary(drug_name)
if primary.get("success"):
return {**primary, "source": "internal_formulary"}
logger.warning("Internal formulary failed for %s: %s", drug_name, primary.get("error"))
# Attempt 2: External API
secondary = fetch_drug_from_external_api(drug_name)
if secondary.get("success"):
return {**secondary, "source": "external_api"}
logger.warning("External API also failed for %s: %s", drug_name, secondary.get("error"))
# Attempt 3: Cached data (may be stale)
cache = get_from_cache(f"drug:{drug_name.lower()}")
if cache:
return {
"success": True,
"data": cache,
"source": "cache",
"warning": "Data may be up to 24 hours old"
}
# All sources exhausted
return make_error(
"Drug information unavailable from all sources",
hint="Try the generic name or contact the pharmacy team directly.",
drug_name=drug_name,
sources_tried=["internal_formulary", "external_api", "cache"]
)Handling Errors in the Agent Loop
The agent loop must append error results just like success results — the LLM needs to see the error to respond appropriately.
import json
import openai
client = openai.OpenAI()
def run_resilient_agent(user_message: str, tools: list, tool_map: dict) -> str:
messages = [
{
"role": "system",
"content": (
"You are a clinical assistant. When a tool returns an error with "
"'retry_suggested: true', retry the tool call once. "
"When a tool returns an error, explain the situation to the user "
"clearly and suggest next steps based on the 'hint' field."
)
},
{"role": "user", "content": user_message}
]
for iteration in range(8):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
msg = response.choices[0].message
if not msg.tool_calls:
return msg.content or ""
messages.append(msg)
for tc in msg.tool_calls:
fn_name = tc.function.name
try:
fn_args = json.loads(tc.function.arguments)
except json.JSONDecodeError as e:
# The LLM returned malformed JSON arguments — extremely rare
result = {
"success": False,
"error": "Malformed tool arguments",
"detail": str(e),
"raw_arguments": tc.function.arguments
}
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result)
})
continue
if fn_name not in tool_map:
result = {
"success": False,
"error": f"Tool '{fn_name}' not available",
"available_tools": list(tool_map.keys())
}
else:
# Tool functions never raise — they return error dicts
result = tool_map[fn_name](**fn_args)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result, default=str)
})
return "Unable to complete request — too many iterations."Testing Error Paths
Error handling only works if you test it. Use mocking to simulate failures:
import pytest
from unittest.mock import patch, MagicMock
import psycopg2
def test_get_patient_record_db_failure():
"""Tool should return error dict, not raise, when DB is down."""
with patch("tools.patient.get_db_connection") as mock_conn:
mock_conn.side_effect = ConnectionError("DB connection refused")
result = get_patient_record("P-00123")
assert result["success"] is False
assert "Database temporarily unavailable" in result["error"]
assert "hint" in result
def test_get_patient_record_not_found():
"""Tool should return structured not-found error."""
with patch("tools.patient.get_db_connection") as mock_conn:
mock_cursor = MagicMock()
mock_cursor.fetchone.return_value = None
mock_conn.return_value.execute.return_value = mock_cursor
result = get_patient_record("P-99999")
assert result["success"] is False
assert "not found" in result["error"].lower()
assert result["patient_id"] == "P-99999"
def test_fetch_drug_retries_on_timeout():
"""Tool should retry up to max_retries times on timeout."""
call_count = 0
with patch("tools.drug.httpx.Client") as mock_client_cls:
def side_effect(*args, **kwargs):
nonlocal call_count
call_count += 1
if call_count < 3:
raise httpx.TimeoutException("Timeout")
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {"drugGroup": {"conceptGroup": []}}
return mock_response
mock_client = MagicMock()
mock_client.__enter__ = MagicMock(return_value=mock_client)
mock_client.__exit__ = MagicMock(return_value=False)
mock_client.get.side_effect = side_effect
mock_client_cls.return_value = mock_client
result = fetch_drug_from_external_api("Metformin", max_retries=3, base_delay=0)
assert result["success"] is True
assert result["attempts"] == 3Common Error Categories and How to Handle Each
| Error Type | Strategy | LLM Hint | |---|---|---| | Not found | Return structured not-found error immediately | "Verify the ID/name and try again" | | Validation failure | Return field-level errors before any I/O | "Correct these fields: ..." | | Network timeout | Retry with backoff, then return error | "Service slow, try again later" | | Rate limit (429) | Exponential backoff retry | "Try again in a few seconds" | | Server error (5xx) | Retry up to 3 times | "Service error, retrying" | | Client error (4xx) | Return immediately, no retry | "Check input parameters" | | DB connection | Return error with IT contact hint | "Contact IT support" | | Permission denied | Return clear access error | "You don't have access to this data" | | Data quality | Return with warning, include partial data | "Data may be incomplete" |
Summary
- Tools must never raise exceptions — always return a dict
- Use a consistent error format with
success,error, andhintfields - Implement retry logic for transient failures (timeout, 5xx) with exponential backoff
- Use fallback sources when the primary source is unavailable
- The agent loop must append error results, not skip them
- Add
retry_suggested: Trueto error dicts when the LLM should retry automatically - Test all error paths with mocking — error handling you haven't tested doesn't work