Learnixo

AutoGen Essentials · Lesson 11 of 11

Interview: AutoGen vs LangGraph — When Would You Choose?

How to Use This Lesson

This lesson is structured as an interview. Each question is what a technical interviewer might ask a systems engineer being evaluated for a role involving multi-agent AI systems. The answers are detailed enough to demonstrate real understanding — not surface-level familiarity.

Read through each question before the answer. Try to formulate your own response. Compare it to the answer provided.


Q1: What is AutoGen and what problem does it solve?

Interviewer prompt: "Walk me through what AutoGen is. Assume I understand LLMs but have never used an agent framework."

Strong answer:

AutoGen is a framework from Microsoft Research that makes it practical to build systems where multiple AI agents collaborate on a task through conversation.

The problem it solves is coordination. When you have a task that is too complex for a single LLM call — say, "write production-grade Python, test it, fix any bugs, and document it" — you need agents that can iterate with each other. Without a framework, you write a lot of orchestration code: who speaks when, how results are passed between agents, when the workflow ends.

AutoGen's key design decision is to model agent interaction as a conversation thread. Every agent reads the same shared message history and contributes to it. This is intuitive because it matches how LLMs are trained — on human conversation — so agents naturally know how to respond to each other without special wiring.

The two core agent types are AssistantAgent (LLM-backed, generates responses) and UserProxyAgent (executor, manages human input). In a two-agent setup, the assistant thinks and the proxy executes — creating a loop where the assistant can write code, see the execution results, and self-correct.


Q2: When would you choose AutoGen over LangGraph?

Interviewer prompt: "Your team is building a multi-agent system. One engineer says use LangGraph, another says AutoGen. How do you decide?"

Strong answer:

The decision comes down to the nature of the workflow: conversational-iterative versus deterministic-sequential.

Choose AutoGen when:

  • The workflow is inherently conversational and non-linear. Example: a coder and a reviewer going back and forth on a piece of code until it passes review. Neither agent knows in advance how many rounds this will take.
  • Code generation and execution is central. AutoGen's first-class code execution support (UserProxyAgent detecting and running code blocks automatically) is much easier to set up than wiring this manually in LangGraph.
  • You want rapid prototyping. Getting a two-agent conversation running in AutoGen takes 20 lines of code. LangGraph requires defining a typed state class, nodes, edges, and a compiled graph.
  • Human-in-the-loop at flexible points. AutoGen's human_input_mode makes this easy.

Choose LangGraph when:

  • The workflow must be auditable and deterministic. Compliance, healthcare, and finance systems often require that you can prove exactly what happened and in what order. LangGraph's explicit graph edges make this easy.
  • You need parallel node execution. LangGraph supports fan-out/fan-in patterns. AutoGen conversations are sequential.
  • Fine-grained state management is required. LangGraph's typed state dict is easier to persist, version, and snapshot than AutoGen's conversation history.
  • You need sophisticated retry and error handling per node. LangGraph integrates with LangSmith for detailed observability.

The honest middle ground: Many production systems use both. AutoGen for the creative, iterative inner loops (code generation, research) and LangGraph for the outer orchestration (routing between tools, managing overall workflow state).


Q3: How does code execution work in AutoGen, and what are the architectural implications?

Interviewer prompt: "Explain the code execution pipeline in AutoGen."

Strong answer:

When AssistantAgent generates a response that includes a fenced code block (```python ... ```), UserProxyAgent performs the following:

  1. Detects the code block in the message content using regex pattern matching
  2. Extracts the code and writes it to a temporary .py file in the configured work_dir
  3. Spawns a subprocess (via subprocess.run) and executes the file
  4. Captures stdout and stderr
  5. Formats the result as exitcode: N (execution succeeded/failed)\nCode output:\n...
  6. Sends this as the next message back to the assistant

The assistant then sees the execution result in the conversation history and can respond to it — fixing bugs if the exit code was non-zero, or wrapping up if the code worked.

Architectural implications:

The code runs in the same process environment as the host. This means the agent has access to all environment variables (including API keys), the full filesystem (with the host process's permissions), and the network. In production, this is a significant security surface.

The standard mitigation is Docker execution (use_docker="python:3.11-slim"), which confines the generated code to an isolated container. The tradeoff is startup latency (several seconds per execution) and the need to pre-install any packages the agent might need in the Docker image.

A second implication is determinism. The generated code is not deterministic — different runs can produce different implementations. If you need reproducible outputs, you should log the generated code and its outputs to a persistent store (database, file, or observability platform) on every run.


Q4: What are the risks of autonomous code execution? How would you mitigate them in a production system?

Interviewer prompt: "Your manager asks: 'Is it safe to deploy this AutoGen agent in production with use_docker=False?' What do you say?"

Strong answer:

I would say: not without additional controls, and ideally not at all — use Docker.

The risks of use_docker=False in production:

Data exposure. The generated code can read any file the process has access to, including .env files, credentials mounted as volumes, database connection strings, and SSH keys.

Destructive operations. The agent can delete files, overwrite configuration, or corrupt data — either by accident (the LLM generated buggy code) or via prompt injection.

Exfiltration. The code can make arbitrary HTTP requests — send your secrets to an external server, call unauthorised APIs, or exfiltrate data.

Resource exhaustion. Without a timeout, the agent could spawn infinite loops, allocate all available memory, or fork-bomb the host.

Mitigation strategies in priority order:

  1. Use Docker (use_docker="python:3.11-slim"). Isolates the filesystem and network. No access to host secrets. Container is destroyed after each execution.

  2. Set execution timeouts ("timeout": 30 in code_execution_config). Kills runaway processes.

  3. Use a dedicated, empty workspace directory that contains no sensitive data and has no write access to anything critical.

  4. Network isolation. If using Docker, configure --network none to block all network access from the container.

  5. Disable code execution entirely if the task does not require it. Use registered tools instead — you control exactly what functions the agent can call.

  6. Human approval before execution (human_input_mode="ALWAYS"). A human reviews every message, including code, before it runs. Impractical for automation but appropriate for high-stakes operations.

  7. Input sanitisation. If the initial message comes from an untrusted source (e.g., a web form), sanitise it to prevent prompt injection attacks that could instruct the agent to exfiltrate data.

The right answer for most production systems is: Docker + timeout + isolated workspace + registered tools for any operations that touch sensitive resources.


Q5: How do you handle agent loops — situations where the conversation does not terminate?

Interviewer prompt: "In production, your AutoGen pipeline gets stuck in a loop. Agents keep talking to each other but never terminate. How do you diagnose and fix it?"

Strong answer:

First, why loops happen:

The most common causes are (1) is_termination_msg never returns True because the LLM forgot to say TERMINATE, (2) agents keep asking each other clarifying questions without making progress, and (3) the task is genuinely unsolvable by the agents as configured.

Diagnosis:

Look at user_proxy.chat_messages[assistant] after the conversation. Read the last 5 messages. Ask:

  • Is the assistant making progress or repeating itself?
  • Is the assistant asking a question the user proxy cannot answer?
  • Did the LLM produce a TERMINATE-like word but with a typo or in a different case?
Python
# Quick diagnostic: print last 5 messages
history = user_proxy.chat_messages[assistant]
for msg in history[-5:]:
    name = msg.get("name", msg["role"])
    content = msg["content"][:200].replace("\n", " ")
    print(f"[{name}]: {content}")

Fixes:

  1. Defence in depth on termination. Always use both is_termination_msg AND max_consecutive_auto_reply AND max_turns. At least one of these will catch the loop.

  2. Stronger TERMINATE instruction. Put it in the system message very explicitly: "Your FINAL message MUST end with the single word TERMINATE on its own line. Do not omit it."

  3. Case-insensitive termination check:

    Python
    is_termination_msg=lambda msg: "terminate" in msg.get("content", "").lower()
  4. Timeout wrapper. Wrap initiate_chat in a thread with join(timeout=N). If it does not complete in N seconds, kill it.

  5. Progress detection. Write a custom is_termination_msg that also terminates if the last 3 messages are nearly identical (the agent is spinning).

Python
def no_progress_termination(msg: dict, history: list) -> bool:
    if "TERMINATE" in msg.get("content", ""):
        return True
    # Check if last 3 messages are repetitive (simplified)
    if len(history) >= 3:
        last_3 = [m.get("content", "")[:100] for m in history[-3:]]
        if len(set(last_3)) == 1:  # all identical
            return True
    return False

For production systems, I always log every conversation start, every termination event, and the reason for termination. This gives you a paper trail to diagnose loops after the fact.


Q6: How do you test AutoGen pipelines?

Interviewer prompt: "How would you write automated tests for an AutoGen multi-agent system?"

Strong answer:

Testing AutoGen pipelines has three layers: unit testing individual components, integration testing the conversation flow, and end-to-end validation of outputs.

Layer 1: Unit tests for non-LLM components

Tools, registered functions, and termination conditions can be tested without any LLM:

Python
import pytest

def test_termination_condition_basic():
    from my_pipeline import safe_termination
    assert safe_termination({"content": "Task complete. TERMINATE"}) is True
    assert safe_termination({"content": "Still working..."}) is False
    assert safe_termination({"content": None}) is False
    assert safe_termination({}) is False

def test_stock_price_tool():
    from my_tools import get_stock_price
    result = get_stock_price("AAPL")
    assert "price" in result
    assert result["currency"] == "USD"

    error_result = get_stock_price("INVALID_TICKER")
    assert "error" in error_result

Layer 2: Integration tests with mock LLM

Use autogen's cache_seed to replay LLM responses deterministically. Set cache_seed=42 during development — AutoGen will cache the LLM response and replay it on subsequent runs.

Python
llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "..."}],
    "cache_seed": 42,           # same seed = same cached response every run
}

For full LLM mocking (no API calls), replace llm_config with a stub using unittest.mock:

Python
from unittest.mock import patch, MagicMock

def test_pipeline_without_api_call():
    mock_response = MagicMock()
    mock_response.choices = [MagicMock()]
    mock_response.choices[0].message.content = "Here is the code:\n```python\nprint('hello')\n```\nTERMINATE"

    with patch("openai.resources.chat.completions.Completions.create", return_value=mock_response):
        # Run your pipeline  no real API call happens
        ...

Layer 3: End-to-end output validation

For each pipeline, define the expected properties of a successful output:

Python
def test_code_generation_pipeline():
    # Run the pipeline with cache_seed
    user_proxy.initiate_chat(assistant, message="Write a function to sum a list.", max_turns=6)

    history = user_proxy.chat_messages[assistant]

    # Check: conversation terminated cleanly
    last_msg = history[-1]["content"]
    assert "TERMINATE" in last_msg, "Conversation should end with TERMINATE"

    # Check: at least one code block was generated
    all_content = " ".join(m.get("content", "") for m in history)
    assert "```python" in all_content, "Should contain a Python code block"

    # Check: code execution succeeded
    exec_messages = [m for m in history if "exitcode:" in m.get("content", "")]
    assert len(exec_messages) > 0, "Code should have been executed"
    assert all("exitcode: 0" in m["content"] for m in exec_messages), "All executions should succeed"

Q7: What are AutoGen's production limitations?

Interviewer prompt: "You are pitching AutoGen to your team for a production use case. What limitations do you need to disclose?"

Strong answer:

1. Non-determinism. Because every response is LLM-generated, two runs of the same workflow can produce different results, take different numbers of turns, and cost different amounts. This is difficult to budget for and makes reproducibility hard.

2. Context window scaling. Every agent call sends the full conversation history to the LLM. A 50-turn conversation with long messages can exceed even a 128k-token context window. You need a strategy for long-running conversations: periodic summarisation, history pruning, or resetting with a summary.

3. Observability is limited out of the box. AutoGen v0.2 prints to stdout. For production, you need to wrap initiate_chat and instrument every message with your logging platform (structured logging, OpenTelemetry, etc.).

4. Cost unpredictability. Because conversations can run for variable numbers of turns, token costs per workflow are non-deterministic. A workflow that usually costs $0.01 can occasionally cost $0.50 if the agent loops. Always set max_consecutive_auto_reply and max_turns to cap costs.

5. Testing is hard. LLM-backed workflows are inherently probabilistic. Regression tests that worked yesterday may fail tomorrow due to model updates, even with the same seed.

6. Error recovery is limited. If an agent produces a wrong result and the conversation ends, restarting from the middle is not natively supported. You would need to implement checkpointing yourself.

7. Concurrency. AutoGen v0.2 does not have native support for running multiple conversations in parallel (within a single Python process). You need asyncio or multiprocessing for concurrency.

8. Model coupling. AutoGen v0.2 is tightly coupled to OpenAI-compatible APIs. Switching to a different provider or model version can change agent behaviour in subtle ways.

For most teams, 1, 4, and 5 are the show-stoppers. The mitigations: cache_seed for development determinism, strict max_turns + max_consecutive_auto_reply for cost control, and a suite of end-to-end tests with cached responses.


Q8: Design a multi-agent system for automated document review.

Interviewer prompt: "Design an AutoGen system that automatically reviews a legal contract. The system should extract key clauses, flag risks, summarise findings, and route for human sign-off on flagged items."

Strong answer:

Here is the architectural design:

User submits contract text
        │
        ▼
┌───────────────┐
│  extractor    │  AssistantAgent
│               │  Extracts: parties, dates, key clauses, obligations
└──────┬────────┘
       │ structured JSON of extracted clauses
       ▼
┌───────────────┐
│  risk_analyst │  AssistantAgent
│               │  Flags: unusual clauses, missing standard protections,
│               │  liability caps, indemnification terms
└──────┬────────┘
       │ risk report with severity (HIGH / MEDIUM / LOW)
       ▼
┌───────────────┐
│  summariser   │  AssistantAgent
│               │  Writes executive summary (non-legal language)
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  router       │  Custom speaker selection function
│               │  If HIGH risk flags → route to human_reviewer (TERMINATE mode)
│               │  If no HIGH flags → route to auto_approver
└──────┬────────┘
       │
   ┌───┴──────────────────┐
   │                      │
   ▼                      ▼
human_reviewer         auto_approver
UserProxyAgent         AssistantAgent
TERMINATE mode         Automatically approves
Human sees summary     Sends final decision
and risks, approves    TERMINATE
or rejects

Implementation sketch:

Python
import autogen
import os
import json

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}],
    "temperature": 0,
}

extractor = autogen.AssistantAgent(
    name="extractor",
    llm_config=llm_config,
    system_message="""You are a contract analysis specialist.
    Extract from the provided contract:
    - Party names
    - Contract duration
    - Key obligations for each party
    - Payment terms
    - Termination clauses
    Return as JSON. Then say: Handoff to risk_analyst""",
)

risk_analyst = autogen.AssistantAgent(
    name="risk_analyst",
    llm_config=llm_config,
    system_message="""You are a legal risk analyst.
    Review the extracted contract data and flag risks.
    For each risk, assign severity: HIGH, MEDIUM, or LOW.
    HIGH = potential significant financial or legal exposure.
    Return: a JSON list of {clause, risk, severity}.
    Then say: Handoff to summariser""",
)

summariser = autogen.AssistantAgent(
    name="summariser",
    llm_config=llm_config,
    system_message="""You are a communications specialist.
    Summarise the contract review findings in plain English.
    Audience: business executive, not a lawyer.
    Include: what the contract is, top 3 concerns, recommendation.
    Then say: Handoff to router""",
)

auto_approver = autogen.AssistantAgent(
    name="auto_approver",
    llm_config=llm_config,
    system_message="""You approve contracts with no HIGH-risk flags.
    State: 'Auto-approved. No high-risk items found.'
    Then say TERMINATE.""",
)

# Human reviewer  TERMINATE mode so human sees the output before ending
human_reviewer = autogen.UserProxyAgent(
    name="human_reviewer",
    human_input_mode="TERMINATE",    # pause for human to review before terminating
    max_consecutive_auto_reply=0,    # do not auto-reply  always ask the human
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
    code_execution_config=False,
)


def contract_router(last_speaker, groupchat):
    """Route to human if HIGH risks found, otherwise auto-approve."""
    agents_by_name = {a.name: a for a in groupchat.agents}
    messages = groupchat.messages

    # Handoff protocol: agents say "Handoff to X" to route
    if messages:
        last_content = messages[-1].get("content", "")
        for name in agents_by_name:
            if f"handoff to {name.lower()}" in last_content.lower():
                return agents_by_name[name]

    # Special case: after summariser, check if there are HIGH risks
    if last_speaker.name == "summariser":
        all_text = " ".join(m.get("content", "") for m in messages)
        if "HIGH" in all_text:
            return agents_by_name["human_reviewer"]
        else:
            return agents_by_name["auto_approver"]

    return groupchat.agents[0]


group_chat = autogen.GroupChat(
    agents=[human_reviewer, extractor, risk_analyst, summariser, auto_approver],
    messages=[],
    max_round=12,
    speaker_selection_method=contract_router,
)

manager = autogen.GroupChatManager(
    groupchat=group_chat,
    llm_config=llm_config,
)

# Sample contract text
contract_text = """
SERVICE AGREEMENT between Acme Corp (Client) and TechVendor Ltd (Provider).
Term: 24 months from signing.
Payment: $50,000/month, net 60.
Liability cap: Provider's total liability shall not exceed one month's fees ($50,000).
Indemnification: Client shall indemnify Provider against all third-party claims.
Termination: Either party may terminate with 90 days notice.
IP: All work product is owned by Provider unless separately agreed in writing.
"""

human_reviewer.initiate_chat(
    manager,
    message=f"Please review this contract:\n\n{contract_text}",
)

Key design decisions to discuss in the interview:

  • The handoff protocol (agents say "Handoff to X") gives deterministic routing without requiring the LLM to guess
  • HIGH-risk routing to human_reviewer with TERMINATE mode ensures a human reviews before the workflow ends
  • The auto_approver path keeps low-risk contracts fully automated
  • All structured data (extracted clauses, risk flags) flows through the conversation history, making it auditable
  • In production, you would store group_chat.messages to a database for audit trail

Summary: Key Points to Remember

| Topic | What to say | |---|---| | What is AutoGen | Conversation-centric multi-agent framework; every interaction is a message | | vs LangGraph | AutoGen: iterative/conversational. LangGraph: deterministic/graph | | Code execution | LLM writes code, UserProxyAgent executes it in subprocess or Docker | | Security | Never run use_docker=False in production with untrusted input | | Loops | Defence in depth: is_termination_msg + max_consecutive_auto_reply + max_turns | | Testing | cache_seed for determinism; mock LLM for unit tests; validate conversation properties | | Production limits | Cost unpredictability, context window scaling, limited observability | | System design | Use handoff protocol for routing; log everything; human-in-the-loop for high-risk paths |