Interview: Multi-Agent Pattern Questions
10 Q&A pairs covering multi-agent patterns for AI engineering interviews. Topics: supervisor vs peer, pipeline, when to use which, failure modes, and evaluation.
Q1: What are the three main multi-agent patterns?
A: The three core patterns are:
Supervisor-Worker: A central supervisor agent decomposes a goal into subtasks and delegates each to specialist worker agents. The supervisor coordinates, collects results, and synthesizes the final output. Best for tasks with independent subtasks.
Peer-to-Peer (Debate): Agents communicate directly without a coordinator. One agent proposes, another critiques. They iterate until convergence. Best for adversarial verification and high-stakes decisions.
Pipeline: A linear sequence where each agent processes the output of the previous one. Best for tasks with clear, ordered stages (research → analyze → write).
Q2: When would you choose supervisor over pipeline?
A: Choose supervisor when subtasks can run in parallel and may require dynamic routing. For example: a research supervisor might simultaneously dispatch agents to search PubMed, the FDA database, and drug interaction databases. The results come back concurrently.
Choose pipeline when tasks must happen sequentially and each stage consumes the full output of the previous stage. For example: parsing a document → extracting entities → generating a summary. Stage 2 needs all of stage 1's output before starting.
The key question is: Can the subtasks run at the same time? If yes, supervisor. If each step depends on the previous, pipeline.
Q3: How do you prevent a supervisor from creating infinite subtask loops?
A: Three mechanisms:
-
Max depth: the supervisor is given a budget of N total tool calls. After N, it must produce a final answer.
-
Task registry: the supervisor maintains a set of completed task IDs. Before creating a task, it checks the registry. If the same task has been run before, it skips it.
-
Decomposition limit: when building the task graph, validate that no task is a prerequisite of itself (cycle detection).
In practice, most frameworks (LangGraph, AutoGen) let you set max_iterations on the agent loop which acts as a hard stop.
Q4: What is the "lost in translation" failure mode in pipelines?
A: Each time output passes between pipeline stages, information can be lost or distorted. If stage 1 produces rich data but passes only a summary to stage 2, stage 2 works with less than what stage 1 knew.
Mitigations:
- Use typed Pydantic schemas at each boundary — schemas force explicit fields and prevent silent data loss
- Pass through all relevant fields from earlier stages, not just the most recent output
- Add a "context" or "metadata" field that stages can populate and subsequent stages can read
This is why ContentOutput in a drug pipeline should receive both ResearchOutput and AnalysisOutput, not just the analysis. The writer needs the raw research facts too.
Q5: How do you evaluate a multi-agent system?
A: Evaluation has two levels:
Stage-level evaluation: evaluate each agent independently. Does the researcher return accurate facts? Does the analyst correctly classify risk levels? Run a golden dataset through each stage and check output against ground truth.
End-to-end evaluation: evaluate the full system on a task completion metric. For a content pipeline: does the final article contain accurate information, cite sources, and meet safety requirements? Human raters review a sample of 50 outputs per week.
Trajectory evaluation: for longer-running agents, compare the steps taken against a reference trajectory. Did the agent use the right tools in roughly the right order? Excessive tool calls suggest inefficiency; missing tool calls suggest capability gaps.
Q6: What happens when one agent in a pipeline fails?
A: Three options:
-
Hard fail: the pipeline raises an error and stops. Use when you cannot produce a meaningful partial result. The caller gets an error and can retry.
-
Graceful degradation: skip the failed stage and use a fallback. For example, if the analysis stage fails, write content from the research output alone with a reduced confidence flag.
-
Retry with backoff: the orchestrator retries the failed stage up to N times with exponential backoff. Use for transient failures (API timeout, rate limit).
Always log which stage failed with enough context to reproduce the failure. Without this, debugging multi-agent failures becomes very hard.
Q7: How do peer-to-peer agents reach convergence?
A: Three approaches:
Explicit agreement signal: the critic agent includes a specific phrase like "I am satisfied with this assessment" when it has no further objections. The orchestrator detects this and stops.
Round limit: a hard maximum number of debate rounds (usually 2-3). After this, the proposer gives a final synthesis regardless of whether the critic is satisfied.
Judge agent: a third LLM is used as a neutral judge. After each round, the judge reads the debate and decides whether the proposer has adequately addressed all concerns. If yes, the debate ends.
In practice, combine all three: the debate ends on the first of (explicit agreement, 3 rounds, judge satisfied).
Q8: How do you control the cost of multi-agent systems?
A: Multi-agent systems can get expensive because each turn of the conversation costs tokens.
Cheaper models for orchestration: use GPT-4o only for the most complex tasks. For routing decisions, convergence checks, and judge calls, use GPT-4o mini (10× cheaper).
Context trimming: agents don't need the full conversation history every turn. Summarize old turns and only pass recent context. This is especially important for long debates.
Short-circuit on easy cases: before routing to a multi-agent pipeline, classify whether the query is simple enough for a single agent response. Only escalate complex queries to the full pipeline.
Stage-level caching: cache the output of each pipeline stage. If a later stage fails and you retry, you don't need to re-run earlier stages.
Q9: How do you handle disagreement in a peer-to-peer debate?
A: If agents fail to converge after maximum rounds, you have three options:
-
Return the proposer's last answer: the proposer has the task-specific expertise, so their final position is reasonable even if the critic disagrees.
-
Return both positions with a flag: surface both the proposal and the critique to the user with a "review recommended" flag. Useful in high-stakes domains where a human should make the final call.
-
Escalate to a human reviewer: add the unresolved debate to a review queue. A human SME reads the transcript and makes the final decision. Log the debate so the reviewer has full context.
The right choice depends on the domain risk level. For medical information, escalate to human review. For content writing, the proposer's last answer is usually acceptable.
Q10: Design a multi-agent system for automated literature review.
A: Requirements: given a medical research question, search PubMed, extract relevant findings, assess quality of evidence, and produce a structured summary.
Pattern choice: Supervisor-Worker for the search phase (parallel across sources), then Pipeline for the synthesis phase.
Architecture:
User Query
│
▼
[Query Expansion Agent] ← generates 3 search variants
│
▼
[Supervisor] ← dispatches parallel searches
├── [PubMed Search Agent]
├── [Clinical Trials Agent]
└── [Preprint Agent]
│
▼ (collect all results)
[Evidence Quality Rater] ← pipeline stage
│
▼
[Synthesis Writer] ← pipeline stage
│
▼
[Fact Checker] ← peer-to-peer: checks claims against retrieved papers
│
▼
Final Literature SummarySafety controls:
- Each agent has a tool allowlist — search agents can only search, not post
- Max 50 papers retrieved total (prevent context overflow)
- Fact checker must cite specific paper IDs for each claim it verifies
- Human review queue for high-impact outputs
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.