Learnixo
Back to blog
AI Systemsintermediate

Code Execution: Agents That Write and Run Code

AutoGen's code generation and execution pipeline: LocalCommandLineCodeExecutor vs DockerCommandLineCodeExecutor, security implications, and a real data analysis example.

Asma Hafeez KhanMay 15, 20268 min read
AutoGenAI AgentsMulti-AgentPython
Share:𝕏

AutoGen's Most Powerful Feature

The ability to write and execute code autonomously is what sets AutoGen apart from most agent frameworks. An AutoGen agent does not just suggest code — it writes it, runs it, sees the output, and revises based on what it observed. This self-correcting loop turns a language model into a genuine computational problem-solver.

This power comes with real risks. By the end of this lesson you will know both sides: how to use code execution effectively and how to do so without opening serious security holes.


How Code Execution Works

When AssistantAgent generates a response containing a fenced code block, UserProxyAgent detects it, extracts the code, writes it to a temporary file, executes it in a subprocess, captures stdout and stderr, and sends the result back as the next message.

AssistantAgent generates:
─────────────────────────────────
  "Here is the analysis:
   ```python
   import pandas as pd
   df = pd.read_csv('data.csv')
   print(df.describe())
   ```"
─────────────────────────────────
                │
                ▼
UserProxyAgent detects code block
  → writes to: coding_workspace/tmp_code_abc123.py
  → executes:  python coding_workspace/tmp_code_abc123.py
  → captures:  stdout + stderr
                │
                ▼
UserProxyAgent sends back:
─────────────────────────────────
  "exitcode: 0 (execution succeeded)
   Code output:
          col1  col2  col3
   count   100   100   100
   mean    4.5   3.2   7.1
   ..."
─────────────────────────────────
                │
                ▼
AssistantAgent reads result, continues analysis

The Two Executor Types

AutoGen v0.2 provides two code executor implementations:

1. LocalCommandLineCodeExecutor

Runs code directly in the host system's shell. Fast and simple, but unrestricted.

Python
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
    code_execution_config={
        "work_dir": "coding_workspace",   # where code files are saved
        "use_docker": False,               # use local execution
        "timeout": 60,                     # kill process after 60 seconds
        "last_n_messages": 3,              # scan last N messages for code blocks
    },
)

Pros: No setup required, fastest execution
Cons: Agent has full access to the filesystem, network, and installed packages

2. DockerCommandLineCodeExecutor

Runs code inside a Docker container. Provides filesystem and network isolation.

Python
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
    code_execution_config={
        "work_dir": "coding_workspace",
        "use_docker": "python:3.11-slim",  # Docker image to use
        "timeout": 60,
    },
)

Pros: Isolated filesystem, no access to host secrets, no persistent state between runs
Cons: Requires Docker installed, slower startup, no access to host-side packages

Executor Comparison Table

| Feature | LocalCommandLineCodeExecutor | DockerCommandLineCodeExecutor | |---|---|---| | Filesystem access | Full host filesystem | Isolated container only | | Network access | Full host network | Configurable (can be disabled) | | Installed packages | Whatever is on host | Only what is in the image | | Startup time | Immediate | Several seconds | | Suitable for production | Only with strict review | Yes, with proper image | | Works on Windows | Yes | Yes (with Docker Desktop) | | Can access local files | Yes | Only if volume-mounted |


Security Implications of Code Execution

This is critical to understand before deploying AutoGen in any real environment.

When use_docker=False, the generated code runs with the same permissions as your Python process. This means the agent can:

  • Read any file your process can read (including .env files, credentials, SSH keys)
  • Write to any location your process has write access to
  • Make network requests to any host
  • Install packages with pip install
  • Delete files
  • Spawn subprocesses

Concrete Risk Example

Python
# An LLM could generate code like this (either by mistake or via prompt injection):
import os
import shutil

# Read environment variables (including secrets)
print(os.environ.get("OPENAI_API_KEY"))
print(os.environ.get("DATABASE_URL"))

# Or delete the workspace
shutil.rmtree(".")

If this code is in a code block and use_docker=False, AutoGen will execute it.

Mitigation Strategies

Strategy 1: Use Docker in production (strongly recommended)

Python
code_execution_config={
    "use_docker": "python:3.11-slim",
    "timeout": 30,
}

Strategy 2: Set strict timeouts to prevent runaway processes

Python
code_execution_config={
    "use_docker": False,
    "timeout": 10,   # kill after 10 seconds  prevents infinite loops
}

Strategy 3: Use a dedicated working directory with no sensitive files

Python
import os
os.makedirs("sandboxed_workspace", exist_ok=True)

code_execution_config={
    "work_dir": "sandboxed_workspace",   # empty directory with no secrets
    "use_docker": False,
    "timeout": 30,
}

Strategy 4: Disable code execution and use registered tools instead

Python
# For sensitive environments, turn off code execution entirely
code_execution_config=False
# Then register only approved tools via @register_for_execution

Strategy 5: Human approval before execution (human_input_mode="ALWAYS")

Python
user_proxy = autogen.UserProxyAgent(
    human_input_mode="ALWAYS",   # human reviews every message including code
    ...
)

Real Example: Data Analysis Agent

This is a complete, real-world example. The agent analyses a CSV file autonomously — it reads the data, computes statistics, and generates a text summary report.

Python
import autogen
import os
import pandas as pd

# Create sample data for the agent to analyse
sample_data = """date,product,region,revenue,units
2026-01-05,Widget Pro,North,12500,125
2026-01-12,Gadget Lite,South,4200,84
2026-01-19,Widget Pro,East,8900,89
2026-02-03,Widget Pro,North,14200,142
2026-02-11,Gadget Lite,West,3800,76
2026-02-18,Widget Pro,South,11300,113
2026-03-02,Gadget Lite,North,5100,102
2026-03-09,Widget Pro,East,9600,96
2026-03-16,Widget Pro,West,16800,168
2026-03-23,Gadget Lite,South,4700,94
"""

# Write sample data to the workspace
os.makedirs("data_workspace", exist_ok=True)
with open("data_workspace/sales.csv", "w") as f:
    f.write(sample_data)

# Configure AutoGen
llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}],
    "temperature": 0,
}

analyst = autogen.AssistantAgent(
    name="data_analyst",
    llm_config=llm_config,
    system_message="""You are a data analyst. When given a CSV file to analyse:

    1. Read the file using pandas
    2. Check the shape, columns, and data types
    3. Compute revenue totals by product
    4. Compute revenue totals by region
    5. Find the top-performing month
    6. Print a clean summary report

    The file is at: data_workspace/sales.csv
    Use print() statements to show your results clearly.
    After the analysis is complete and you've confirmed it ran successfully, say TERMINATE.
    """,
)

executor = autogen.UserProxyAgent(
    name="executor",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=8,
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
    code_execution_config={
        "work_dir": "data_workspace",
        "use_docker": False,
        "timeout": 30,
        "last_n_messages": 3,
    },
)

executor.initiate_chat(
    analyst,
    message="Please analyse the sales data and produce a summary report.",
)

Expected Generated Code

The analyst will generate something like:

Python
import pandas as pd

# Load the data
df = pd.read_csv("sales.csv")

print("=== SALES DATA ANALYSIS REPORT ===\n")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"Columns: {list(df.columns)}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print()

# Revenue by product
print("--- Revenue by Product ---")
product_revenue = df.groupby("product")["revenue"].agg(["sum", "count", "mean"])
product_revenue.columns = ["Total Revenue", "Transactions", "Avg Revenue"]
print(product_revenue.to_string())
print()

# Revenue by region
print("--- Revenue by Region ---")
region_revenue = df.groupby("region")["revenue"].sum().sort_values(ascending=False)
for region, rev in region_revenue.items():
    print(f"  {region}: ${rev:,.0f}")
print()

# Monthly totals
print("--- Monthly Revenue ---")
df["month"] = pd.to_datetime(df["date"]).dt.to_period("M")
monthly = df.groupby("month")["revenue"].sum().sort_values(ascending=False)
for month, rev in monthly.items():
    print(f"  {month}: ${rev:,.0f}")
print()

top_month = monthly.index[0]
print(f"Top month: {top_month} (${monthly.iloc[0]:,.0f})")
print()
print("=== END OF REPORT ===")

Expected Execution Output

exitcode: 0 (execution succeeded)
Code output:
=== SALES DATA ANALYSIS REPORT ===

Shape: 10 rows x 5 columns
Columns: ['date', 'product', 'region', 'revenue', 'units']
Date range: 2026-01-05 to 2026-03-23

--- Revenue by Product ---
              Total Revenue  Transactions  Avg Revenue
product
Gadget Lite        17800.0             4       4450.0
Widget Pro         73300.0             6      12216.7

--- Revenue by Region ---
  North: 26700
  East: 18500
  West: 20600
  South: 20200

--- Monthly Revenue ---
  2026-03: 36200
  2026-02: 29300
  2026-01: 25600

Top month: 2026-03 ($36,200)

=== END OF REPORT ===

Controlling Which Code Blocks Are Executed

By default, AutoGen executes the last code block found in the most recent last_n_messages messages. You can control this:

Python
code_execution_config={
    "work_dir": "workspace",
    "use_docker": False,
    "timeout": 30,
    "last_n_messages": 1,      # only look in the very last message
                                # prevents re-executing old code blocks
}

Setting last_n_messages: 1 is often safer — it prevents AutoGen from re-executing code from earlier in the conversation if the agent happens to quote it.


Disabling Code Execution When You Don't Need It

If your workflow is purely about tool calls or chat without code, set code_execution_config=False:

Python
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=5,
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
    code_execution_config=False,   # no code execution
)

This prevents any accidental code execution if the assistant generates code blocks during a non-coding task.


Checking Code Execution Results Programmatically

Python
import re

def extract_execution_results(history: list) -> list[dict]:
    """Parse all code execution results from a conversation history."""
    results = []
    for msg in history:
        content = msg.get("content", "")
        if not isinstance(content, str):
            continue
        if "exitcode:" not in content:
            continue

        exit_code_match = re.search(r"exitcode: (\d+)", content)
        output_match = re.search(r"Code output:\n(.*?)$", content, re.DOTALL)

        results.append({
            "exit_code": int(exit_code_match.group(1)) if exit_code_match else -1,
            "output": output_match.group(1).strip() if output_match else "",
            "succeeded": exit_code_match and exit_code_match.group(1) == "0",
        })

    return results


# After conversation
history = executor.chat_messages[analyst]
exec_results = extract_execution_results(history)

print(f"Total code executions: {len(exec_results)}")
print(f"Successful: {sum(1 for r in exec_results if r['succeeded'])}")
print(f"Failed: {sum(1 for r in exec_results if not r['succeeded'])}")

Summary

  • AutoGen extracts code blocks from assistant messages and executes them in a subprocess
  • LocalCommandLineCodeExecutor is fast but unrestricted — use only in development
  • DockerCommandLineCodeExecutor provides isolation — use in production
  • The biggest security risk: agents can read secrets, write files, and make network requests
  • Always set a timeout, use Docker in production, and keep the workspace free of sensitive data
  • Set code_execution_config=False to disable code execution entirely when not needed

Next: we tackle human input mode in depth — how to design workflows that mix automation with human approval.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.