Managing Context Window in Agents
Keep long-running agents effective as message history grows — using rolling windows, hierarchical summarization, and selective memory strategies with token budget tracking.
Managing Context Window in Agents
The context window is your agent's working memory. Everything the agent knows about the current task must fit inside it — the system prompt, conversation history, tool results, and retrieved knowledge. When the context fills up, you have to make choices: truncate, summarize, or selectively retain.
Bad context management is one of the most common reasons production agents fail silently. The agent starts ignoring earlier parts of the conversation, contradicts itself, or loses track of the user's original goal.
This lesson gives you three concrete strategies and a working Python implementation.
Why Context Management Matters
Model context windows have grown dramatically — GPT-4o supports 128K tokens, Claude 3.7 supports 200K. But do not be fooled into thinking "bigger context = no problem":
Cost — Every token in the context costs money on every API call. A 50-turn conversation with tool results can easily reach 30,000 tokens. At $5 per million input tokens, that is $0.15 per call — multiplied by thousands of daily users.
Latency — Larger prompts take longer to process. Time-to-first-token grows roughly linearly with context size.
Attention degradation — Research shows LLMs pay less attention to information in the middle of very long contexts ("lost in the middle" problem). Critical information buried in a long history may be effectively ignored.
Hard limits — Even with large context windows, tool-heavy agents can exhaust the limit in surprisingly few iterations. A ReAct loop that calls five tools per iteration, each returning 500 tokens, consumes 2,500 tokens per iteration — exhausting a 128K window in about 50 iterations.
Estimating Token Counts
Before implementing any strategy, you need to count tokens accurately:
import tiktoken
from typing import List
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count the number of tokens in a text string."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def count_messages_tokens(messages: List[dict], model: str = "gpt-4o") -> int:
"""
Count the total tokens in a messages list.
Includes per-message overhead (approximately 4 tokens per message).
"""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
total = 0
for msg in messages:
total += 4 # Per-message overhead: role, separators
for key, value in msg.items():
if isinstance(value, str):
total += len(encoding.encode(value))
elif isinstance(value, list):
# Handle tool calls in assistant messages
for item in value:
if isinstance(item, dict):
for v in item.values():
if isinstance(v, str):
total += len(encoding.encode(v))
total += 2 # Final priming tokens
return total
class TokenBudget:
"""
Tracks token usage and provides warnings before limit is reached.
"""
def __init__(self, model_limit: int, safety_margin: float = 0.15):
"""
Args:
model_limit: Maximum context window for the model
safety_margin: Reserve this fraction for the response (default 15%)
"""
self.model_limit = model_limit
self.available = int(model_limit * (1 - safety_margin))
def check(self, messages: List[dict], model: str = "gpt-4o") -> dict:
"""Check current token usage against the budget."""
used = count_messages_tokens(messages, model)
remaining = self.available - used
utilization = used / self.available
return {
"used": used,
"available": self.available,
"remaining": remaining,
"utilization": utilization,
"is_over_budget": remaining < 0,
"needs_compression": utilization > 0.75,
}Strategy 1: Rolling Window
The simplest strategy — keep only the most recent N messages. Oldest messages are dropped when the window exceeds the limit. Always keep the system prompt.
import openai
client = openai.OpenAI()
class RollingWindowContext:
"""
Context manager that keeps the last N messages plus the system prompt.
Simple, predictable, but loses old context permanently.
"""
def __init__(self, max_tokens: int = 8000, model: str = "gpt-4o-mini"):
self.max_tokens = max_tokens
self.model = model
self.messages: List[dict] = []
self._system_messages: List[dict] = []
def add_system(self, content: str) -> None:
"""Add a system message (preserved through all truncation)."""
msg = {"role": "system", "content": content}
self._system_messages.append(msg)
self.messages.append(msg)
def add(self, role: str, content: str) -> None:
"""Add a user or assistant message."""
self.messages.append({"role": role, "content": content})
self._truncate()
def _truncate(self) -> None:
"""Remove oldest non-system messages until under token limit."""
while True:
current_tokens = count_messages_tokens(self.messages, self.model)
if current_tokens <= self.max_tokens:
break
# Find the oldest non-system message
non_system_indices = [
i for i, m in enumerate(self.messages)
if m["role"] != "system"
]
if not non_system_indices:
break # Cannot truncate further
removed = self.messages.pop(non_system_indices[0])
print(f"[Rolling window] Dropped: {removed['role']}: {removed['content'][:50]}...")
def get(self) -> List[dict]:
return self.messages.copy()
@property
def current_tokens(self) -> int:
return count_messages_tokens(self.messages, self.model)Strategy 2: Hierarchical Summarization
Instead of dropping old messages, compress them into a summary. This preserves the gist of what was discussed without keeping every token.
class SummarizingContext:
"""
Context manager that summarizes old messages when approaching the token limit.
Preserves more information than rolling window by compressing rather than discarding.
"""
def __init__(
self,
token_limit: int = 12000,
compression_threshold: float = 0.80,
keep_recent: int = 6,
model: str = "gpt-4o-mini",
summary_model: str = "gpt-4o-mini",
):
"""
Args:
token_limit: Maximum tokens before triggering compression
compression_threshold: Compress when utilization exceeds this (0.0-1.0)
keep_recent: Always keep this many recent messages uncompressed
model: Model for token counting
summary_model: Model used to generate summaries
"""
self.token_limit = token_limit
self.compression_threshold = compression_threshold
self.keep_recent = keep_recent
self.model = model
self.summary_model = summary_model
self.messages: List[dict] = []
self.compression_count = 0
def add(self, role: str, content: str) -> None:
"""Add a message and compress if needed."""
self.messages.append({"role": role, "content": content})
utilization = count_messages_tokens(self.messages, self.model) / self.token_limit
if utilization > self.compression_threshold:
self._compress()
def _compress(self) -> None:
"""
Summarize the oldest messages, replacing them with a summary.
Always keeps the system prompt and the most recent messages intact.
"""
# Partition: system, compressible, recent
system_msgs = [m for m in self.messages if m["role"] == "system"]
non_system = [m for m in self.messages if m["role"] != "system"]
if len(non_system) <= self.keep_recent:
return # Not enough messages to compress
to_compress = non_system[:-self.keep_recent]
to_keep = non_system[-self.keep_recent:]
if not to_compress:
return
# Build text for summarization
conversation_text = "\n".join(
f"{m['role'].upper()}: {m['content'][:500]}"
for m in to_compress
)
summary_prompt = (
f"Summarize the following conversation segment concisely. "
f"Focus on: key decisions made, facts established, user goals, "
f"and any important context. Be specific — preserve numbers, names, "
f"and specific details that might be referenced later.\n\n"
f"{conversation_text}"
)
response = client.chat.completions.create(
model=self.summary_model,
messages=[{"role": "user", "content": summary_prompt}],
max_tokens=400,
temperature=0,
)
summary_text = response.choices[0].message.content
self.compression_count += 1
summary_message = {
"role": "system",
"content": f"[Conversation summary #{self.compression_count}]: {summary_text}",
}
# Rebuild: system messages, summary, recent messages
self.messages = system_msgs + [summary_message] + to_keep
print(
f"[Summarizer] Compressed {len(to_compress)} messages into summary. "
f"New token count: {count_messages_tokens(self.messages, self.model)}"
)
def get(self) -> List[dict]:
return self.messages.copy()Strategy 3: Selective Memory
For task-focused agents, not every past message is equally relevant. Selective memory keeps messages that are relevant to the current task and drops irrelevant ones:
class SelectiveContext:
"""
Context manager that keeps messages relevant to a current task goal.
Uses embedding similarity to determine relevance.
"""
def __init__(
self,
task_goal: str,
token_limit: int = 10000,
relevance_threshold: float = 0.5,
always_keep_last: int = 4,
):
self.task_goal = task_goal
self.token_limit = token_limit
self.relevance_threshold = relevance_threshold
self.always_keep_last = always_keep_last
self.messages: List[dict] = []
self._goal_embedding = self._embed(task_goal)
def _embed(self, text: str) -> List[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text[:8000], # Truncate very long texts
)
return response.data[0].embedding
def _cosine_sim(self, a: List[float], b: List[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = sum(x**2 for x in a) ** 0.5
nb = sum(x**2 for x in b) ** 0.5
return dot / (na * nb) if na and nb else 0.0
def _relevance_score(self, msg: dict) -> float:
"""Compute relevance of a message to the current task goal."""
content = msg.get("content", "")
if not content or msg["role"] == "system":
return 1.0 # System messages are always relevant
msg_embedding = self._embed(content[:1000])
return self._cosine_sim(self._goal_embedding, msg_embedding)
def add(self, role: str, content: str) -> None:
self.messages.append({"role": role, "content": content})
token_count = count_messages_tokens(self.messages)
if token_count > self.token_limit:
self._prune()
def _prune(self) -> None:
"""Remove low-relevance messages, keeping recent and system messages."""
system_msgs = [m for m in self.messages if m["role"] == "system"]
non_system = [m for m in self.messages if m["role"] != "system"]
if len(non_system) <= self.always_keep_last:
return
to_evaluate = non_system[:-self.always_keep_last]
always_keep = non_system[-self.always_keep_last:]
# Score and filter
scored = [
(self._relevance_score(m), m)
for m in to_evaluate
]
kept = [m for score, m in scored if score >= self.relevance_threshold]
dropped = len(to_evaluate) - len(kept)
self.messages = system_msgs + kept + always_keep
if dropped > 0:
print(f"[Selective] Pruned {dropped} low-relevance messages.")
def get(self) -> List[dict]:
return self.messages.copy()Unified Context Manager
from enum import Enum
class ContextStrategy(Enum):
ROLLING = "rolling"
SUMMARIZING = "summarizing"
SELECTIVE = "selective"
def create_context_manager(
strategy: ContextStrategy,
token_limit: int = 10000,
task_goal: str = "",
system_prompt: str = "",
):
"""
Factory function to create the appropriate context manager.
"""
if strategy == ContextStrategy.ROLLING:
ctx = RollingWindowContext(max_tokens=token_limit)
if system_prompt:
ctx.add_system(system_prompt)
return ctx
elif strategy == ContextStrategy.SUMMARIZING:
ctx = SummarizingContext(token_limit=token_limit)
if system_prompt:
ctx.add({"role": "system", "content": system_prompt})
return ctx
elif strategy == ContextStrategy.SELECTIVE:
ctx = SelectiveContext(
task_goal=task_goal or system_prompt,
token_limit=token_limit,
)
if system_prompt:
ctx.messages.append({"role": "system", "content": system_prompt})
return ctx
else:
raise ValueError(f"Unknown strategy: {strategy}")
def run_agent_with_context_management(
user_query: str,
strategy: ContextStrategy = ContextStrategy.SUMMARIZING,
max_turns: int = 20,
) -> str:
"""
Example agent loop with context management.
"""
ctx = create_context_manager(
strategy=strategy,
token_limit=8000,
system_prompt="You are a helpful research assistant.",
)
ctx.add("user", user_query)
for turn in range(max_turns):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=ctx.get(),
)
reply = response.choices[0].message.content
ctx.add("assistant", reply)
# In a real agent, check for task completion here
if "TASK COMPLETE" in reply or turn == max_turns - 1:
return reply
return "Max turns reached."Which Strategy to Use
| Scenario | Recommended Strategy | |---|---| | Short conversations (under 20 turns) | Rolling window (simplest) | | Long research sessions | Summarizing (preserves gist) | | Task-focused agents with a clear goal | Selective (most efficient) | | Customer support with history | Summarizing + episodic memory | | Code debugging sessions | Rolling window (recent context dominates) |
Summary
- The context window is your agent's working memory — manage it deliberately
- Count tokens explicitly before every API call; do not guess
- Rolling window is simple but lossy — good for short sessions
- Hierarchical summarization preserves gist — better for long research sessions
- Selective memory keeps only relevant messages — best for focused task agents
- Reserve 15% of the context for the model's response
- Track token utilization as a metric in production — rising utilization signals context management issues
Next: the Interview module on agent memory and context management questions.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.