Interview: Agent Memory and Context Questions
Ten Q&A pairs covering agent memory types, context window strategies, and state persistence — the questions interviewers actually ask for agentic AI engineering roles.
Interview: Agent Memory and Context Questions
Agentic AI engineering interviews increasingly focus on memory architecture and context management — the unsexy but critical infrastructure that separates demo agents from production agents. These questions test whether you understand the constraints of LLM context windows and the tradeoffs between different memory strategies.
Q1: What are the four types of agent memory and when would you use each?
Answer:
The four memory types mirror cognitive science categories:
In-context memory is the current message history passed to the LLM on every call. Use it for everything in the current conversation — it has zero retrieval latency and perfect fidelity. The constraint is the context window: as history grows, you must manage it.
Episodic memory stores records of past conversations or events, indexed by semantic embeddings and retrieved by similarity search. Use it to give agents continuity across sessions — "you told me last week that your budget is $10,000." Implemented with a vector database like Pinecone or pgvector.
Semantic memory stores general factual knowledge — product docs, company policies, medical guidelines — that the agent draws on across all users. It is the agent's knowledge base. Also embedding-indexed for similarity retrieval.
Procedural memory encodes "how to behave" — in model weights (via fine-tuning) or in reusable prompt templates. Use it for consistent formatting, domain-specific tone, or workflows that must be applied reliably.
Most production agents combine in-context with one or two external memory types. A customer support agent, for example, uses in-context for the current conversation, episodic for customer history, and semantic for the policy knowledge base.
Q2: An agent is reaching the context window limit after 20 turns. What are your options?
Answer:
Three main strategies, each with different tradeoffs:
Rolling window — Keep only the last N messages (or last N tokens). Simple and predictable, but permanently discards early context. Good for agents where recent context dominates (coding, debugging).
Hierarchical summarization — When approaching the limit, compress the oldest messages into a summary and replace them with the summary. Better than dropping because it preserves the gist. The summary goes in as a system message at the top of the remaining history. Good for research and support conversations.
Selective retention — Use embedding similarity to score each message's relevance to the current task goal. Drop low-relevance messages first. Efficient but requires an embedding call per message — adds latency.
In practice, I combine strategies: rolling window for tool result messages (they tend to be large but transient), summarization for the user/assistant dialogue (high semantic value), and always preserve system messages.
Q3: What is the "lost in the middle" problem and how does it affect agent design?
Answer:
Research by Liu et al. (2023) showed that LLMs perform significantly worse at retrieving information located in the middle of long contexts compared to information at the beginning or end. The model attends well to the first few thousand tokens and the last few thousand, but "forgets" content in the middle.
For agent design this means:
- Do not rely on position — Critical instructions belong either in the system prompt (top) or in the most recent messages (bottom), not buried in old conversation history.
- Summarize middle content — When compressing context, pull key facts from the middle into a system-level summary that goes at the top.
- Retrieval over recall — Instead of putting everything in context and hoping the model finds it, use retrieval-augmented generation with vector search. Retrieve only the most relevant chunks and inject them near the current user message.
- Test with long contexts — Benchmark your agent at various history lengths. Quality often degrades after 50,000 tokens even for models with 128K context windows.
Q4: How do you implement persistent agent state across multiple user sessions?
Answer:
Session state needs to survive process restarts, so it cannot live only in memory. The architecture depends on what you are persisting:
Conversation history — Store messages in a database keyed by (user_id, session_id). On session start, load the last N messages. I use PostgreSQL with a messages table. For long sessions, store only the summary (generated at session end) rather than full message history.
User preferences and facts — Extract key facts during conversations ("the user mentioned they prefer Python", "budget is $50,000") and store them as structured records or embeddings in the user's profile table.
Task state for long-running agents — Store the current plan, completed steps, and intermediate results in a task_runs table. Include a status column (pending/in_progress/completed/failed). This allows agents to resume after crashes.
Memory retrieval on session start — On every new session, query episodic memory for the top-3 most relevant past conversations (embedding search against the new user message), and inject them as context.
The pattern is: serialize what matters, retrieve what is relevant, do not inject everything.
Q5: What is the difference between episodic and semantic memory in an agent context?
Answer:
The distinction is about what the memory represents:
Episodic memory stores specific events with time and context — "On March 3rd, user Alice asked about the refund policy and received a $50 credit." It is autobiographical. You retrieve episodes when you need to know "what happened before in this context."
Semantic memory stores general, timeless knowledge — "The refund policy allows returns within 30 days for a full refund." It is factual. You retrieve semantic facts when you need to answer "what is true."
Both use embedding vectors and similarity search for retrieval, but they serve different purposes. If a customer asks "can I return this?", you retrieve semantic memory (the return policy). If the same customer calls again and asks "what happened with my last return request?", you retrieve episodic memory.
In implementation terms, they often use the same infrastructure (vector database) but different collections with different metadata schemas. Episodic records have user_id, timestamp, and session_id. Semantic records have source, category, and last_updated.
Q6: How do you prevent an agent from losing track of the original task goal in a long conversation?
Answer:
Goal drift is a real problem in ReAct agents — the agent gets absorbed in a sub-task and forgets the original objective. Several mitigations:
Explicit goal anchoring — Include the original user goal in the system prompt. Every LLM call sees it. Phrase it prominently: "Your task: . Do not return until this task is complete."
Goal reminder injection — Every N iterations, inject a reminder message: "Reminder: the original task is . Are you making progress toward it?" This re-anchors the model.
Task state object — Maintain a structured Python object tracking the goal, current plan, completed steps, and remaining steps. Include a JSON representation of this state in every prompt so the model can check its own progress.
Termination conditions tied to goal — Define success criteria for the original goal, not just "the model stopped generating tool calls." The agent should check: "Does my current answer address the original goal?" before stopping.
class TaskTracker:
def __init__(self, goal: str, success_criteria: list[str]):
self.goal = goal
self.success_criteria = success_criteria
self.completed_criteria = []
self.steps_taken = []
def to_context_string(self) -> str:
remaining = [c for c in self.success_criteria if c not in self.completed_criteria]
return (
f"Original goal: {self.goal}\n"
f"Completed: {self.completed_criteria}\n"
f"Still needed: {remaining}"
)Q7: How do token costs scale in a multi-turn agent session, and how do you control them?
Answer:
In a naive implementation, costs grow quadratically. With a rolling context window, every API call sends all previous messages. Turn 1 sends T1 tokens. Turn 2 sends T1 + T2 tokens. Turn N sends the sum of all previous turns. For a 50-turn session with 500 tokens per turn, the total input tokens is roughly (50 * 51 / 2) * 500 = 637,500 tokens — orders of magnitude more than the 25,000 tokens of content.
Cost controls:
Summarization — Replace old messages with dense summaries. Cuts context size without losing meaning.
Model tiering — Use cheaper models for intermediate steps (tool calling, reflection) and expensive models only for the final synthesis.
Caching — OpenAI and Anthropic both offer prompt caching. Static prefixes (system prompt, policy docs) are cached after the first call, cutting input costs by 80-90% for the cached portion.
Tool result truncation — Tool results from web search or database queries are often verbose. Truncate them to the most relevant 500 characters before adding to context.
Session budgets — Set a hard token budget per session. When the budget is reached, force-summarize and warn the user that the session is approaching its limit.
Q8: What is prompt caching and how does it help with agent memory?
Answer:
Prompt caching stores the KV (key-value) attention cache for a prompt prefix on the provider's servers. When you make multiple calls with the same prefix, the cached portion is not reprocessed — you pay a fraction of the normal input token price and get lower latency.
For agents, this is particularly valuable for:
Static system prompts — A detailed system prompt with tool schemas, company policy, and persona can easily be 5,000 tokens. With caching, you pay full price only on the first call; subsequent calls in the same session pay the cache read rate (typically 10% of normal).
Knowledge base injection — If you always prepend the same product documentation, cache it. Savings are significant at scale.
Implementation note — To benefit from caching, the cached prefix must be identical character-for-character across calls. Even a timestamp in the system prompt breaks caching. Move dynamic content (user name, session date) to the user message, not the system message.
# Anthropic cache_control example
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": static_knowledge_base_text, # Large, reused
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": f"Question: {user_question}", # Dynamic
},
],
}
]Q9: How do you handle the case where a user references something said "a long time ago" that has been summarized or evicted?
Answer:
This is the core tension of context management — compression loses detail. Strategies:
Verbatim storage alongside summaries — Store the full conversation in a database even after summarizing for the context window. When the user references something old, retrieve the original verbatim text using a fuzzy search and inject it.
Named entity extraction — Extract and separately store key entities mentioned in evicted messages: names, dates, numbers, product IDs. These are the things most likely to be referenced later. Store as structured metadata on the conversation record.
Recovery prompt — When the agent receives a reference it cannot resolve ("you mentioned the deadline earlier"), inject a recovery prompt: "The user is referencing something from earlier in the conversation. Here is a summary of what was discussed: . Is this what they are referring to?"
Explicit knowledge capture — During the conversation, when the user states an important fact ("my budget is $50,000"), immediately store it as a semantic memory record — not just in the conversation history. This way it survives summarization.
The honest answer is that no strategy is perfect. Summarization loses information. Tell users when they are in a very long session that early details may not be fully retained.
Q10: What metrics do you track to monitor agent memory health in production?
Answer:
Context window utilization — Average tokens per call as a percentage of the context limit. Track percentile distribution (p50, p95, p99). A rising p99 indicates sessions that are pushing limits.
Summary trigger rate — How often is the summarizer invoked? If it is triggering on almost every turn, your context is too small for the use case.
Goal completion rate by session length — Does task success rate drop for sessions over 20 turns, 50 turns, 100 turns? A drop indicates context management failure.
Cross-session reference accuracy — For episodic memory: when a user references a past conversation, does the agent correctly retrieve and apply it? Sample and manually review.
Memory retrieval latency — Embedding search latency for episodic and semantic retrieval. Should be under 100ms. Rising latency indicates vector index health issues.
Stale memory rate — For semantic memory: how often is retrieved knowledge outdated? Track the age of retrieved facts. If the average retrieved fact is more than 30 days old for a fast-moving domain, trigger a re-indexing.
Track all these as time-series metrics (Prometheus + Grafana or equivalent) with alerts on degradation thresholds.
Summary
Agentic memory interview questions test whether you understand the production constraints that demo agents never reveal. Know the four memory types, be able to explain the tradeoffs of rolling window vs summarization, understand token cost scaling, and have concrete answers about how you would handle edge cases like old references and goal drift. These are the answers that separate engineers who have shipped agents from engineers who have only read about them.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.