Managing Context Window in Agents

The context window is your agent's working memory. Everything the agent knows about the current task must fit inside it — the system prompt, conversation history, tool results, and retrieved knowledge. When the context fills up, you have to make choices: truncate, summarize, or selectively retain.

Bad context management is one of the most common reasons production agents fail silently. The agent starts ignoring earlier parts of the conversation, contradicts itself, or loses track of the user's original goal.

This lesson gives you three concrete strategies and a working Python implementation.

Why Context Management Matters

Model context windows have grown dramatically — GPT-4o supports 128K tokens, Claude 3.7 supports 200K. But do not be fooled into thinking "bigger context = no problem":

Cost — Every token in the context costs money on every API call. A 50-turn conversation with tool results can easily reach 30,000 tokens. At $5 per million input tokens, that is $0.15 per call — multiplied by thousands of daily users.

Latency — Larger prompts take longer to process. Time-to-first-token grows roughly linearly with context size.

Attention degradation — Research shows LLMs pay less attention to information in the middle of very long contexts ("lost in the middle" problem). Critical information buried in a long history may be effectively ignored.

Hard limits — Even with large context windows, tool-heavy agents can exhaust the limit in surprisingly few iterations. A ReAct loop that calls five tools per iteration, each returning 500 tokens, consumes 2,500 tokens per iteration — exhausting a 128K window in about 50 iterations.

Estimating Token Counts

Before implementing any strategy, you need to count tokens accurately:

Python

import tiktoken
from typing import List


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count the number of tokens in a text string."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))


def count_messages_tokens(messages: List[dict], model: str = "gpt-4o") -> int:
    """
    Count the total tokens in a messages list.
    Includes per-message overhead (approximately 4 tokens per message).
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")

    total = 0
    for msg in messages:
        total += 4  # Per-message overhead: role, separators
        for key, value in msg.items():
            if isinstance(value, str):
                total += len(encoding.encode(value))
            elif isinstance(value, list):
                # Handle tool calls in assistant messages
                for item in value:
                    if isinstance(item, dict):
                        for v in item.values():
                            if isinstance(v, str):
                                total += len(encoding.encode(v))
    total += 2  # Final priming tokens
    return total


class TokenBudget:
    """
    Tracks token usage and provides warnings before limit is reached.
    """
    def __init__(self, model_limit: int, safety_margin: float = 0.15):
        """
        Args:
            model_limit: Maximum context window for the model
            safety_margin: Reserve this fraction for the response (default 15%)
        """
        self.model_limit = model_limit
        self.available = int(model_limit * (1 - safety_margin))

    def check(self, messages: List[dict], model: str = "gpt-4o") -> dict:
        """Check current token usage against the budget."""
        used = count_messages_tokens(messages, model)
        remaining = self.available - used
        utilization = used / self.available

        return {
            "used": used,
            "available": self.available,
            "remaining": remaining,
            "utilization": utilization,
            "is_over_budget": remaining < 0,
            "needs_compression": utilization > 0.75,
        }

Strategy 1: Rolling Window

The simplest strategy — keep only the most recent N messages. Oldest messages are dropped when the window exceeds the limit. Always keep the system prompt.

Python

import openai

client = openai.OpenAI()


class RollingWindowContext:
    """
    Context manager that keeps the last N messages plus the system prompt.
    Simple, predictable, but loses old context permanently.
    """

    def __init__(self, max_tokens: int = 8000, model: str = "gpt-4o-mini"):
        self.max_tokens = max_tokens
        self.model = model
        self.messages: List[dict] = []
        self._system_messages: List[dict] = []

    def add_system(self, content: str) -> None:
        """Add a system message (preserved through all truncation)."""
        msg = {"role": "system", "content": content}
        self._system_messages.append(msg)
        self.messages.append(msg)

    def add(self, role: str, content: str) -> None:
        """Add a user or assistant message."""
        self.messages.append({"role": role, "content": content})
        self._truncate()

    def _truncate(self) -> None:
        """Remove oldest non-system messages until under token limit."""
        while True:
            current_tokens = count_messages_tokens(self.messages, self.model)
            if current_tokens <= self.max_tokens:
                break

            # Find the oldest non-system message
            non_system_indices = [
                i for i, m in enumerate(self.messages)
                if m["role"] != "system"
            ]
            if not non_system_indices:
                break  # Cannot truncate further

            removed = self.messages.pop(non_system_indices[0])
            print(f"[Rolling window] Dropped: {removed['role']}: {removed['content'][:50]}...")

    def get(self) -> List[dict]:
        return self.messages.copy()

    @property
    def current_tokens(self) -> int:
        return count_messages_tokens(self.messages, self.model)

Strategy 2: Hierarchical Summarization

Instead of dropping old messages, compress them into a summary. This preserves the gist of what was discussed without keeping every token.

Python

class SummarizingContext:
    """
    Context manager that summarizes old messages when approaching the token limit.
    Preserves more information than rolling window by compressing rather than discarding.
    """

    def __init__(
        self,
        token_limit: int = 12000,
        compression_threshold: float = 0.80,
        keep_recent: int = 6,
        model: str = "gpt-4o-mini",
        summary_model: str = "gpt-4o-mini",
    ):
        """
        Args:
            token_limit: Maximum tokens before triggering compression
            compression_threshold: Compress when utilization exceeds this (0.0-1.0)
            keep_recent: Always keep this many recent messages uncompressed
            model: Model for token counting
            summary_model: Model used to generate summaries
        """
        self.token_limit = token_limit
        self.compression_threshold = compression_threshold
        self.keep_recent = keep_recent
        self.model = model
        self.summary_model = summary_model
        self.messages: List[dict] = []
        self.compression_count = 0

    def add(self, role: str, content: str) -> None:
        """Add a message and compress if needed."""
        self.messages.append({"role": role, "content": content})

        utilization = count_messages_tokens(self.messages, self.model) / self.token_limit
        if utilization > self.compression_threshold:
            self._compress()

    def _compress(self) -> None:
        """
        Summarize the oldest messages, replacing them with a summary.
        Always keeps the system prompt and the most recent messages intact.
        """
        # Partition: system, compressible, recent
        system_msgs = [m for m in self.messages if m["role"] == "system"]
        non_system = [m for m in self.messages if m["role"] != "system"]

        if len(non_system) <= self.keep_recent:
            return  # Not enough messages to compress

        to_compress = non_system[:-self.keep_recent]
        to_keep = non_system[-self.keep_recent:]

        if not to_compress:
            return

        # Build text for summarization
        conversation_text = "\n".join(
            f"{m['role'].upper()}: {m['content'][:500]}"
            for m in to_compress
        )

        summary_prompt = (
            f"Summarize the following conversation segment concisely. "
            f"Focus on: key decisions made, facts established, user goals, "
            f"and any important context. Be specific — preserve numbers, names, "
            f"and specific details that might be referenced later.\n\n"
            f"{conversation_text}"
        )

        response = client.chat.completions.create(
            model=self.summary_model,
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=400,
            temperature=0,
        )
        summary_text = response.choices[0].message.content
        self.compression_count += 1

        summary_message = {
            "role": "system",
            "content": f"[Conversation summary #{self.compression_count}]: {summary_text}",
        }

        # Rebuild: system messages, summary, recent messages
        self.messages = system_msgs + [summary_message] + to_keep

        print(
            f"[Summarizer] Compressed {len(to_compress)} messages into summary. "
            f"New token count: {count_messages_tokens(self.messages, self.model)}"
        )

    def get(self) -> List[dict]:
        return self.messages.copy()

Strategy 3: Selective Memory

For task-focused agents, not every past message is equally relevant. Selective memory keeps messages that are relevant to the current task and drops irrelevant ones:

Python

class SelectiveContext:
    """
    Context manager that keeps messages relevant to a current task goal.
    Uses embedding similarity to determine relevance.
    """

    def __init__(
        self,
        task_goal: str,
        token_limit: int = 10000,
        relevance_threshold: float = 0.5,
        always_keep_last: int = 4,
    ):
        self.task_goal = task_goal
        self.token_limit = token_limit
        self.relevance_threshold = relevance_threshold
        self.always_keep_last = always_keep_last
        self.messages: List[dict] = []
        self._goal_embedding = self._embed(task_goal)

    def _embed(self, text: str) -> List[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text[:8000],  # Truncate very long texts
        )
        return response.data[0].embedding

    def _cosine_sim(self, a: List[float], b: List[float]) -> float:
        dot = sum(x * y for x, y in zip(a, b))
        na = sum(x**2 for x in a) ** 0.5
        nb = sum(x**2 for x in b) ** 0.5
        return dot / (na * nb) if na and nb else 0.0

    def _relevance_score(self, msg: dict) -> float:
        """Compute relevance of a message to the current task goal."""
        content = msg.get("content", "")
        if not content or msg["role"] == "system":
            return 1.0  # System messages are always relevant
        msg_embedding = self._embed(content[:1000])
        return self._cosine_sim(self._goal_embedding, msg_embedding)

    def add(self, role: str, content: str) -> None:
        self.messages.append({"role": role, "content": content})

        token_count = count_messages_tokens(self.messages)
        if token_count > self.token_limit:
            self._prune()

    def _prune(self) -> None:
        """Remove low-relevance messages, keeping recent and system messages."""
        system_msgs = [m for m in self.messages if m["role"] == "system"]
        non_system = [m for m in self.messages if m["role"] != "system"]

        if len(non_system) <= self.always_keep_last:
            return

        to_evaluate = non_system[:-self.always_keep_last]
        always_keep = non_system[-self.always_keep_last:]

        # Score and filter
        scored = [
            (self._relevance_score(m), m)
            for m in to_evaluate
        ]
        kept = [m for score, m in scored if score >= self.relevance_threshold]
        dropped = len(to_evaluate) - len(kept)

        self.messages = system_msgs + kept + always_keep
        if dropped > 0:
            print(f"[Selective] Pruned {dropped} low-relevance messages.")

    def get(self) -> List[dict]:
        return self.messages.copy()

Unified Context Manager

Python

from enum import Enum


class ContextStrategy(Enum):
    ROLLING = "rolling"
    SUMMARIZING = "summarizing"
    SELECTIVE = "selective"


def create_context_manager(
    strategy: ContextStrategy,
    token_limit: int = 10000,
    task_goal: str = "",
    system_prompt: str = "",
):
    """
    Factory function to create the appropriate context manager.
    """
    if strategy == ContextStrategy.ROLLING:
        ctx = RollingWindowContext(max_tokens=token_limit)
        if system_prompt:
            ctx.add_system(system_prompt)
        return ctx
    elif strategy == ContextStrategy.SUMMARIZING:
        ctx = SummarizingContext(token_limit=token_limit)
        if system_prompt:
            ctx.add({"role": "system", "content": system_prompt})
        return ctx
    elif strategy == ContextStrategy.SELECTIVE:
        ctx = SelectiveContext(
            task_goal=task_goal or system_prompt,
            token_limit=token_limit,
        )
        if system_prompt:
            ctx.messages.append({"role": "system", "content": system_prompt})
        return ctx
    else:
        raise ValueError(f"Unknown strategy: {strategy}")


def run_agent_with_context_management(
    user_query: str,
    strategy: ContextStrategy = ContextStrategy.SUMMARIZING,
    max_turns: int = 20,
) -> str:
    """
    Example agent loop with context management.
    """
    ctx = create_context_manager(
        strategy=strategy,
        token_limit=8000,
        system_prompt="You are a helpful research assistant.",
    )

    ctx.add("user", user_query)

    for turn in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=ctx.get(),
        )
        reply = response.choices[0].message.content
        ctx.add("assistant", reply)

        # In a real agent, check for task completion here
        if "TASK COMPLETE" in reply or turn == max_turns - 1:
            return reply

    return "Max turns reached."

Which Strategy to Use

| Scenario | Recommended Strategy | |---|---| | Short conversations (under 20 turns) | Rolling window (simplest) | | Long research sessions | Summarizing (preserves gist) | | Task-focused agents with a clear goal | Selective (most efficient) | | Customer support with history | Summarizing + episodic memory | | Code debugging sessions | Rolling window (recent context dominates) |

Summary

The context window is your agent's working memory — manage it deliberately
Count tokens explicitly before every API call; do not guess
Rolling window is simple but lossy — good for short sessions
Hierarchical summarization preserves gist — better for long research sessions
Selective memory keeps only relevant messages — best for focused task agents
Reserve 15% of the context for the model's response
Track token utilization as a metric in production — rising utilization signals context management issues

Next: the Interview module on agent memory and context management questions.

Managing Context Window in Agents

Managing Context Window in Agents

Why Context Management Matters

Estimating Token Counts

Strategy 1: Rolling Window

Strategy 2: Hierarchical Summarization

Strategy 3: Selective Memory

Unified Context Manager

Which Strategy to Use

Summary

Enjoyed this article?

Leave a comment