Handling Tool Errors Gracefully
Tools fail. Learn how to catch exceptions, return structured error results the LLM can reason about, implement retry logic, and build resilient agent loops.
Why Error Handling Matters in Agent Loops
In a standard API call, an exception propagates up the stack and the caller decides what to do. In an agent loop, an unhandled exception in a tool function breaks the entire conversation. The LLM never sees what went wrong, cannot adapt, and the user gets a crash rather than an explanation.
The correct pattern: tools never raise exceptions to the caller. They catch all failures internally and return a structured error dict that the LLM can read and respond to intelligently.
The Core Pattern: Try/Except in Every Tool
import json
import logging
logger = logging.getLogger(__name__)
def get_patient_record(patient_id: str) -> dict:
"""
Always returns a dict — never raises.
On success: {"success": True, "data": {...}}
On failure: {"success": False, "error": "...", "hint": "..."}
"""
try:
conn = get_db_connection()
record = conn.execute(
"SELECT * FROM patients WHERE patient_id = %s",
(patient_id,)
).fetchone()
if record is None:
return {
"success": False,
"error": "Patient not found",
"patient_id": patient_id,
"hint": "Verify the patient ID format (P-NNNNN) and try again."
}
return {
"success": True,
"data": dict(record)
}
except ConnectionError as e:
logger.error("DB connection failed in get_patient_record: %s", e)
return {
"success": False,
"error": "Database temporarily unavailable",
"patient_id": patient_id,
"hint": "Try again in a few seconds. If the problem persists, contact IT."
}
except Exception as e:
logger.exception("Unexpected error in get_patient_record for %s", patient_id)
return {
"success": False,
"error": "Unexpected error",
"detail": str(e),
"patient_id": patient_id
}The LLM reads the error dict, understands what went wrong, and responds accordingly — e.g., "I wasn't able to find patient P-99999. Could you double-check the ID?"
Error Response Format
Use a consistent error format across all tools so the LLM learns to recognize them:
def make_error(
error: str,
hint: str = None,
retry_suggested: bool = False,
**extra
) -> dict:
"""Standard error response builder."""
result = {
"success": False,
"error": error,
}
if hint:
result["hint"] = hint
if retry_suggested:
result["retry_suggested"] = True
result.update(extra)
return result
# Usage
return make_error(
"Drug not found in formulary",
hint="Try the generic name instead of the brand name.",
drug_name=drug_name
)
return make_error(
"External API timeout",
hint="The service is slow. Retry once.",
retry_suggested=True
)When retry_suggested is True, the LLM (with appropriate system prompting) will retry the tool call automatically.
Retry Logic Inside the Tool
Some failures are transient and should be retried immediately — network timeouts, temporary database unavailability, rate limit responses from external APIs.
import time
import httpx
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def fetch_drug_from_external_api(
drug_name: str,
max_retries: int = 3,
base_delay: float = 0.5
) -> dict:
"""
Fetches drug information from an external API.
Retries on timeout and 5xx errors with exponential backoff.
"""
url = "https://api.rxnorm.nlm.nih.gov/REST/drugs.json"
for attempt in range(1, max_retries + 1):
try:
with httpx.Client(timeout=5.0) as client:
response = client.get(url, params={"name": drug_name})
if response.status_code == 200:
data = response.json()
return {"success": True, "data": data, "attempts": attempt}
if response.status_code == 429:
# Rate limited — wait longer
wait = base_delay * (2 ** attempt)
logger.warning("Rate limited on attempt %d. Waiting %.1fs", attempt, wait)
time.sleep(wait)
continue
if response.status_code >= 500:
# Server error — retry
wait = base_delay * (2 ** (attempt - 1))
logger.warning("Server error %d on attempt %d. Waiting %.1fs",
response.status_code, attempt, wait)
time.sleep(wait)
continue
# Client error (400-range) — don't retry
return make_error(
f"API returned {response.status_code}",
hint="Check the drug name spelling.",
drug_name=drug_name
)
except httpx.TimeoutException:
wait = base_delay * (2 ** (attempt - 1))
logger.warning("Timeout on attempt %d/%d. Waiting %.1fs", attempt, max_retries, wait)
if attempt < max_retries:
time.sleep(wait)
continue
except httpx.RequestError as e:
logger.error("Request error: %s", e)
return make_error("Network error", detail=str(e), drug_name=drug_name)
# All retries exhausted
logger.error("All %d attempts failed for drug: %s", max_retries, drug_name)
return make_error(
"External service unavailable after multiple retries",
hint="Try again later or check the service status.",
drug_name=drug_name,
attempts_made=max_retries
)Max Retries + Fallback Behavior
Some tools have a fallback — if the primary source fails, try a secondary one.
def get_drug_info_with_fallback(drug_name: str) -> dict:
"""
Try primary database first, fall back to external API,
fall back to cached data if both fail.
"""
# Attempt 1: Internal database (fastest, most reliable)
primary = query_internal_formulary(drug_name)
if primary.get("success"):
return {**primary, "source": "internal_formulary"}
logger.warning("Internal formulary failed for %s: %s", drug_name, primary.get("error"))
# Attempt 2: External API
secondary = fetch_drug_from_external_api(drug_name)
if secondary.get("success"):
return {**secondary, "source": "external_api"}
logger.warning("External API also failed for %s: %s", drug_name, secondary.get("error"))
# Attempt 3: Cached data (may be stale)
cache = get_from_cache(f"drug:{drug_name.lower()}")
if cache:
return {
"success": True,
"data": cache,
"source": "cache",
"warning": "Data may be up to 24 hours old"
}
# All sources exhausted
return make_error(
"Drug information unavailable from all sources",
hint="Try the generic name or contact the pharmacy team directly.",
drug_name=drug_name,
sources_tried=["internal_formulary", "external_api", "cache"]
)Handling Errors in the Agent Loop
The agent loop must append error results just like success results — the LLM needs to see the error to respond appropriately.
import json
import openai
client = openai.OpenAI()
def run_resilient_agent(user_message: str, tools: list, tool_map: dict) -> str:
messages = [
{
"role": "system",
"content": (
"You are a clinical assistant. When a tool returns an error with "
"'retry_suggested: true', retry the tool call once. "
"When a tool returns an error, explain the situation to the user "
"clearly and suggest next steps based on the 'hint' field."
)
},
{"role": "user", "content": user_message}
]
for iteration in range(8):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
msg = response.choices[0].message
if not msg.tool_calls:
return msg.content or ""
messages.append(msg)
for tc in msg.tool_calls:
fn_name = tc.function.name
try:
fn_args = json.loads(tc.function.arguments)
except json.JSONDecodeError as e:
# The LLM returned malformed JSON arguments — extremely rare
result = {
"success": False,
"error": "Malformed tool arguments",
"detail": str(e),
"raw_arguments": tc.function.arguments
}
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result)
})
continue
if fn_name not in tool_map:
result = {
"success": False,
"error": f"Tool '{fn_name}' not available",
"available_tools": list(tool_map.keys())
}
else:
# Tool functions never raise — they return error dicts
result = tool_map[fn_name](**fn_args)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result, default=str)
})
return "Unable to complete request — too many iterations."Testing Error Paths
Error handling only works if you test it. Use mocking to simulate failures:
import pytest
from unittest.mock import patch, MagicMock
import psycopg2
def test_get_patient_record_db_failure():
"""Tool should return error dict, not raise, when DB is down."""
with patch("tools.patient.get_db_connection") as mock_conn:
mock_conn.side_effect = ConnectionError("DB connection refused")
result = get_patient_record("P-00123")
assert result["success"] is False
assert "Database temporarily unavailable" in result["error"]
assert "hint" in result
def test_get_patient_record_not_found():
"""Tool should return structured not-found error."""
with patch("tools.patient.get_db_connection") as mock_conn:
mock_cursor = MagicMock()
mock_cursor.fetchone.return_value = None
mock_conn.return_value.execute.return_value = mock_cursor
result = get_patient_record("P-99999")
assert result["success"] is False
assert "not found" in result["error"].lower()
assert result["patient_id"] == "P-99999"
def test_fetch_drug_retries_on_timeout():
"""Tool should retry up to max_retries times on timeout."""
call_count = 0
with patch("tools.drug.httpx.Client") as mock_client_cls:
def side_effect(*args, **kwargs):
nonlocal call_count
call_count += 1
if call_count < 3:
raise httpx.TimeoutException("Timeout")
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {"drugGroup": {"conceptGroup": []}}
return mock_response
mock_client = MagicMock()
mock_client.__enter__ = MagicMock(return_value=mock_client)
mock_client.__exit__ = MagicMock(return_value=False)
mock_client.get.side_effect = side_effect
mock_client_cls.return_value = mock_client
result = fetch_drug_from_external_api("Metformin", max_retries=3, base_delay=0)
assert result["success"] is True
assert result["attempts"] == 3Common Error Categories and How to Handle Each
| Error Type | Strategy | LLM Hint | |---|---|---| | Not found | Return structured not-found error immediately | "Verify the ID/name and try again" | | Validation failure | Return field-level errors before any I/O | "Correct these fields: ..." | | Network timeout | Retry with backoff, then return error | "Service slow, try again later" | | Rate limit (429) | Exponential backoff retry | "Try again in a few seconds" | | Server error (5xx) | Retry up to 3 times | "Service error, retrying" | | Client error (4xx) | Return immediately, no retry | "Check input parameters" | | DB connection | Return error with IT contact hint | "Contact IT support" | | Permission denied | Return clear access error | "You don't have access to this data" | | Data quality | Return with warning, include partial data | "Data may be incomplete" |
Summary
- Tools must never raise exceptions — always return a dict
- Use a consistent error format with
success,error, andhintfields - Implement retry logic for transient failures (timeout, 5xx) with exponential backoff
- Use fallback sources when the primary source is unavailable
- The agent loop must append error results, not skip them
- Add
retry_suggested: Trueto error dicts when the LLM should retry automatically - Test all error paths with mocking — error handling you haven't tested doesn't work
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.