Interview: Choose the Right Memory for a Use Case

Q1: A clinical pharmacist bot needs to handle 30-minute consultations. What memory type and why?

Answer:

30-minute consultations mean potentially 20-40 conversation turns. ConversationBufferMemory would exceed context limits at this scale.

Recommendation: ConversationSummaryBufferMemory

Python

from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI

memory = ConversationSummaryBufferMemory(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    max_token_limit=2000,   # 2000 tokens of verbatim history
    memory_key="chat_history",
    return_messages=True,
)

Rationale:

Keeps the last 2000 tokens verbatim (most recent 5-10 turns)
Compresses older turns into a summary
Summary preserves: which drugs were discussed, indications mentioned, dosing agreed upon
Recent turns are verbatim for precise follow-up questions

What to watch for:

Precise lab values (INR 3.4) may be approximated in summary → store clinical facts separately
Summarization adds cost (~$0.0002/turn for gpt-4o-mini) — acceptable for clinical use
Use gpt-4o-mini for summarization (not gpt-4o) to minimize cost

Alternative if precise recall is critical:

Store structured clinical data outside memory, use memory only for conversational flow:

Python

class ClinicalSession:
    def __init__(self):
        self.memory = ConversationSummaryBufferMemory(llm=..., max_token_limit=2000)
        self.clinical_facts = {}   # {"inr": "3.4", "drug": "warfarin", "dose": "5mg"}

    def extract_facts(self, text: str) -> None:
        """Extract precise clinical values with a fact extractor chain."""
        # Separate LLM call to extract key-value pairs
        pass

Q2: A medical research assistant helps users explore literature over multiple days. How do you implement memory across sessions?

Answer:

Multi-session memory requires persistence. In-memory options (buffer, summary) reset on restart.

Recommendation: Persistent VectorStoreRetrieverMemory + session metadata

Python

from langchain.memory import VectorStoreRetrieverMemory
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

def get_user_memory(user_id: str) -> VectorStoreRetrieverMemory:
    """Load or create persistent memory for a specific user."""
    vectorstore = Chroma(
        collection_name=f"research_memory_{user_id}",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
        persist_directory=f"./memory_db/{user_id}",
    )
    return VectorStoreRetrieverMemory(
        retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
        memory_key="past_research",
    )

# Day 1
user_memory = get_user_memory("researcher_42")
user_memory.save_context(
    {"input": "What do we know about warfarin pharmacogenomics?"},
    {"output": "CYP2C9 and VKORC1 polymorphisms explain 40-50% of warfarin dose variance..."},
)

# Day 2 — memory reloaded from disk
user_memory = get_user_memory("researcher_42")
# Query: "continuation of warfarin genetics research" → retrieves Day 1 context
past = user_memory.load_memory_variables({"input": "Tell me more about CYP2C9 variants"})
print(past["past_research"])  # Shows Day 1 context about warfarin pharmacogenomics

Why vector memory over summary for research:

Researchers return to specific topics unpredictably ("what did we say about CYP2C9?")
Vector retrieval finds semantically relevant past exchanges, not just recent ones
Researchers may have sessions about 20+ different topics over months

Metadata for organizing memories:

Python

# Add metadata to memory entries for filtering
vectorstore.add_texts(
    texts=[f"Q: {question}\nA: {answer}"],
    metadatas=[{
        "user_id": user_id,
        "topic": "pharmacogenomics",
        "date": "2026-05-16",
        "session_id": "session_abc",
    }],
)

# Later: retrieve only pharmacogenomics memories
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"topic": "pharmacogenomics"},
    }
)

Q3: A customer support bot handles 10,000 sessions per day, each lasting 5-10 turns. What's the architecture?

Answer:

At 10,000 sessions/day, in-process memory is impossible — servers restart, load balancers distribute sessions. Need distributed, stateless memory.

Recommendation: Redis-backed buffer memory with session isolation

Python

import redis
import json
from langchain_core.messages import messages_to_dict, messages_from_dict, HumanMessage, AIMessage

class RedisSessionMemory:
    """Redis-backed per-session conversation history."""

    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        ttl_seconds: int = 3600,   # Sessions expire after 1 hour
        max_turns: int = 10,       # Window of last 10 turns
    ):
        self.client = redis.from_url(redis_url, decode_responses=True)
        self.ttl = ttl_seconds
        self.max_turns = max_turns

    def _key(self, session_id: str) -> str:
        return f"chat:{session_id}"

    def load(self, session_id: str) -> list:
        data = self.client.get(self._key(session_id))
        if not data:
            return []
        return messages_from_dict(json.loads(data))

    def save(self, session_id: str, history: list) -> None:
        # Keep only last max_turns pairs
        max_msgs = self.max_turns * 2
        if len(history) > max_msgs:
            history = history[-max_msgs:]
        
        self.client.setex(
            self._key(session_id),
            self.ttl,
            json.dumps(messages_to_dict(history)),
        )

    def clear(self, session_id: str) -> None:
        self.client.delete(self._key(session_id))


# Stateless request handler
redis_memory = RedisSessionMemory()

def handle_request(session_id: str, question: str, chain) -> str:
    history = redis_memory.load(session_id)
    response = chain.invoke({"question": question, "chat_history": history})
    history.extend([HumanMessage(content=question), AIMessage(content=response)])
    redis_memory.save(session_id, history)
    return response

Architectural decisions:

Redis: shared across all server instances — no affinity needed
TTL: 1 hour — prevents memory accumulation for abandoned sessions
Window of 10 turns: enough for most support interactions without overflow
Stateless handlers: any server can handle any request

Q4: You're building a multi-user platform where users can share conversation threads. How does memory work?

Answer:

Shared threads require:

Per-thread memory (not per-user)
Consistent history visible to all participants
Access control (who can add to thread)

Python

class SharedThreadMemory:
    """Memory for shared conversation threads."""

    def __init__(self, redis_client, thread_id: str, participant_ids: list[str]):
        self.redis = redis_client
        self.thread_id = thread_id
        self.participants = set(participant_ids)

    def _key(self) -> str:
        return f"thread:{self.thread_id}"

    def load(self) -> list:
        data = self.redis.lrange(self._key(), 0, -1)
        return [json.loads(msg) for msg in data]

    def append(self, user_id: str, role: str, content: str) -> None:
        if user_id not in self.participants:
            raise PermissionError(f"User {user_id} is not a participant")
        
        msg = {
            "role": role,
            "content": content,
            "user_id": user_id,
            "timestamp": time.time(),
        }
        self.redis.rpush(self._key(), json.dumps(msg))

    def get_as_messages(self) -> list:
        raw = self.load()
        result = []
        for msg in raw:
            if msg["role"] == "human":
                result.append(HumanMessage(
                    content=f"[{msg.get('user_id', 'user')}]: {msg['content']}"
                ))
            else:
                result.append(AIMessage(content=msg["content"]))
        return result

Q5: Your chatbot was working but now users report it "forgets" things from earlier. How do you debug this?

Answer:

This is a classic memory failure. Systematic debugging:

Python

# Step 1: Check what's actually in memory
def diagnose_memory(memory, session_id: str = None) -> dict:
    """Inspect current memory state."""
    loaded = memory.load_memory_variables({})

    if hasattr(memory, "chat_memory"):
        messages = memory.chat_memory.messages
        return {
            "memory_type": type(memory).__name__,
            "n_messages": len(messages),
            "total_chars": sum(len(m.content) for m in messages),
            "oldest_message": messages[0].content[:100] if messages else None,
            "newest_message": messages[-1].content[:100] if messages else None,
            "has_summary": hasattr(memory, "moving_summary_buffer") and bool(memory.moving_summary_buffer),
        }
    return {"loaded_keys": list(loaded.keys())}

# Step 2: Check if memory key matches prompt variable
# BUG: memory_key="history" but prompt uses {chat_history} → no memory injected
assert memory.memory_key in prompt.input_variables, \
    f"Memory key '{memory.memory_key}' not in prompt variables {prompt.input_variables}"

# Step 3: Verify save_context is being called
# Wrap chain to log memory saves
original_save = memory.save_context
def logging_save(inputs, outputs):
    print(f"[Memory] Saving: Q={inputs.get('input','')[:50]} A={outputs.get('output','')[:50]}")
    original_save(inputs, outputs)
memory.save_context = logging_save

# Step 4: Check for session ID confusion in multi-user setup
# Each user must have their OWN memory instance, not shared

# Step 5: Check token limit
if hasattr(memory, "max_token_limit"):
    loaded = memory.load_memory_variables({})
    total = sum(len(m.content) for m in loaded.get("chat_history", []))
    tokens = total // 4
    print(f"Memory tokens: {tokens} / {memory.max_token_limit}")
    if tokens < 100:
        print("WARNING: Very little memory — check if summarization is too aggressive")

# Common causes:
# 1. memory_key mismatch (most common)
# 2. save_context not called after each turn
# 3. Multiple memory instances per user (session management bug)
# 4. max_token_limit too low → aggressive summarization loses detail
# 5. Server restart without persistence → memory lost

Interview: Choose the Right Memory for a Use Case

Q1: A clinical pharmacist bot needs to handle 30-minute consultations. What memory type and why?

Q2: A medical research assistant helps users explore literature over multiple days. How do you implement memory across sessions?

Q3: A customer support bot handles 10,000 sessions per day, each lasting 5-10 turns. What's the architecture?

Q4: You're building a multi-user platform where users can share conversation threads. How does memory work?

Q5: Your chatbot was working but now users report it "forgets" things from earlier. How do you debug this?

Enjoyed this article?

Leave a comment