Learnixo
Back to blog
AI Systemsintermediate

System Design: AI Code Review Assistant

Design an AI code review tool that automatically reviews pull requests β€” from GitHub webhook to LLM reviewer to posted comments. Covers diff parsing, chunking large PRs, and quality control.

Asma Hafeez KhanMay 16, 20266 min read
System DesignAI AgentsGitHubCode ReviewLLMOps
Share:𝕏

The Interview Question

"Design an AI code review assistant that automatically reviews GitHub pull requests and posts constructive comments. It should catch bugs, style issues, and security vulnerabilities without being noisy."


Step 1: Clarify Requirements

  • Trigger: On PR opened or updated (GitHub webhook)
  • Languages: Python, TypeScript, C# (configurable per repo)
  • Comment style: Inline comments on specific lines, plus a PR summary
  • Quality bar: Only post comments when confidence is high β€” no spam
  • Latency: Review should complete within 3 minutes of PR creation
  • Scale: 50 repositories, up to 200 PRs/day

Step 2: Back-of-Envelope

200 PRs/day, average 300 lines changed per PR:

  • 200 Γ— 300 lines = 60,000 lines reviewed/day
  • Average comment: 1 comment per 30 lines = ~2,000 comments/day (before filtering)
  • After confidence filtering (50% pass): ~1,000 posted comments/day

Token estimate per PR:

  • Diff context: ~2,000 tokens
  • System prompt + instructions: ~500 tokens
  • Response (comments): ~600 tokens
  • Total: ~3,100 tokens per PR Γ— 200 PRs = 620,000 tokens/day

At GPT-4o pricing: negligible. The challenge is latency and quality, not cost.


System Architecture

GitHub PR Event
     β”‚
     β–Ό webhook
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Webhook        β”‚
β”‚  Receiver       β”‚  ← Validates GitHub signature, enqueues job
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ async
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Review Queue   β”‚  ← Azure Service Bus or Redis queue
β”‚  (per-repo      β”‚
β”‚   prioritization)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Review Worker  β”‚
β”‚  - Fetch diff   β”‚
β”‚  - Chunk diff   β”‚
β”‚  - Run LLM      β”‚
β”‚  - Filter       β”‚
β”‚  - Post commentsβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GitHub API     β”‚  ← POST review comments
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 3: Diff Parsing and Chunking

Large PRs can exceed the context window. The chunking strategy is critical:

Python
# reviewer/diff_parser.py
from dataclasses import dataclass
import re

@dataclass
class DiffChunk:
    file_path: str
    language: str
    start_line: int
    end_line: int
    content: str
    added_lines: list[int]  # which lines are additions

def parse_diff(raw_diff: str) -> list[DiffChunk]:
    """Parse unified diff into reviewable chunks."""
    chunks = []
    current_file = None
    current_lines = []
    current_start = 0
    added_lines = []

    for line in raw_diff.split("\n"):
        if line.startswith("diff --git"):
            if current_file and current_lines:
                chunks.append(DiffChunk(
                    file_path=current_file,
                    language=detect_language(current_file),
                    start_line=current_start,
                    end_line=current_start + len(current_lines),
                    content="\n".join(current_lines),
                    added_lines=added_lines,
                ))
            current_file = extract_file_path(line)
            current_lines = []
            added_lines = []

        elif line.startswith("@@"):
            # Parse hunk header: @@ -a,b +c,d @@
            match = re.search(r"\+(\d+)", line)
            if match:
                current_start = int(match.group(1))
            current_lines.append(line)

        else:
            if line.startswith("+") and not line.startswith("+++"):
                added_lines.append(current_start + len(current_lines))
            current_lines.append(line)

    return chunks


def split_large_diff(chunks: list[DiffChunk], max_tokens: int = 2000) -> list[DiffChunk]:
    """Split chunks that are too large for a single LLM call."""
    result = []
    for chunk in chunks:
        if estimate_tokens(chunk.content) <= max_tokens:
            result.append(chunk)
        else:
            # Split by logical boundaries (function/class definitions)
            sub_chunks = split_at_function_boundaries(chunk, max_tokens)
            result.extend(sub_chunks)
    return result

Step 4: The Review LLM Call

Python
# reviewer/llm_reviewer.py
from openai import AsyncAzureOpenAI
from pydantic import BaseModel

class ReviewComment(BaseModel):
    line: int
    severity: str  # "error" | "warning" | "suggestion"
    category: str  # "bug" | "security" | "style" | "performance"
    comment: str
    confidence: float  # 0.0-1.0

class ReviewResult(BaseModel):
    comments: list[ReviewComment]
    summary: str

REVIEW_SYSTEM_PROMPT = """You are an expert code reviewer. Review the provided diff and identify:

1. **Bugs**: Logic errors, off-by-one, null dereference, incorrect conditions
2. **Security**: SQL injection, XSS, hardcoded secrets, insecure deserialization
3. **Performance**: N+1 queries, unnecessary loops, missing indexes
4. **Style**: Naming, complexity, missing error handling

RULES:
- Only comment on ADDED lines (lines starting with +)
- Only post a comment if confidence is above 0.7
- Be specific: reference the exact line and explain why it's a problem
- Suggest the fix, don't just identify the problem
- Do NOT comment on formatting/whitespace
- Maximum 5 comments per file

Respond as JSON matching the ReviewResult schema."""

async def review_chunk(
    chunk: DiffChunk,
    client: AsyncAzureOpenAI,
) -> ReviewResult:
    messages = [
        {"role": "system", "content": REVIEW_SYSTEM_PROMPT},
        {
            "role": "user",
            "content": (
                f"File: {chunk.file_path} ({chunk.language})\n"
                f"Starting at line {chunk.start_line}\n\n"
                f"```diff\n{chunk.content}\n```\n\n"
                f"Review this diff. Added lines are: {chunk.added_lines}"
            ),
        },
    ]

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0.1,
    )

    return ReviewResult.model_validate_json(
        response.choices[0].message.content
    )

Step 5: Confidence Filtering

Only post comments above the confidence threshold:

Python
# reviewer/filter.py
CONFIDENCE_THRESHOLD = 0.75
MAX_COMMENTS_PER_PR = 20  # avoid spam

def filter_comments(
    all_comments: list[ReviewComment],
    threshold: float = CONFIDENCE_THRESHOLD,
) -> list[ReviewComment]:
    # Filter by confidence
    filtered = [c for c in all_comments if c.confidence >= threshold]

    # Prioritise: errors > warnings > suggestions
    priority = {"error": 0, "warning": 1, "suggestion": 2}
    filtered.sort(key=lambda c: priority.get(c.severity, 3))

    # Cap to avoid noise
    return filtered[:MAX_COMMENTS_PER_PR]

Step 6: Posting Comments to GitHub

Python
# reviewer/github_client.py
import httpx

class GitHubReviewClient:
    def __init__(self, token: str):
        self.client = httpx.AsyncClient(
            base_url="https://api.github.com",
            headers={
                "Authorization": f"Bearer {token}",
                "Accept": "application/vnd.github.v3+json",
            },
        )

    async def post_review(
        self,
        repo: str,
        pr_number: int,
        commit_sha: str,
        comments: list[ReviewComment],
        summary: str,
    ):
        # Build review with inline comments
        review_comments = [
            {
                "path": c.file_path,
                "line": c.line,
                "body": f"**[{c.severity.upper()}]** {c.comment}",
            }
            for c in comments
        ]

        await self.client.post(
            f"/repos/{repo}/pulls/{pr_number}/reviews",
            json={
                "commit_id": commit_sha,
                "body": f"## AI Code Review Summary\n\n{summary}",
                "event": "COMMENT",  # Don't approve or request changes
                "comments": review_comments,
            },
        )

Step 7: Avoiding Noise

The biggest failure mode for code review bots is being too noisy. Engineers mute them.

Rules to reduce noise:

  1. Confidence threshold 0.75+: Only post when reasonably certain
  2. Max 20 comments per PR: Hard cap, prioritise errors
  3. Deduplication: Don't comment on the same pattern twice in one PR
  4. Suppress on draft PRs: Only review when PR is marked ready
  5. Ignore generated files: Skip *.generated.cs, package-lock.json, migrations
  6. Learn from dismissals: If engineers dismiss comments, reduce confidence for that category
Python
IGNORED_PATHS = [
    "*.generated.*",
    "package-lock.json",
    "yarn.lock",
    "migrations/",
    "*.min.js",
]

def should_review_file(file_path: str) -> bool:
    return not any(
        fnmatch.fnmatch(file_path, pattern)
        for pattern in IGNORED_PATHS
    )

Step 8: Quality Metrics

Track these to know if the bot is useful:

| Metric | Target | How to Measure | |---|---|---| | Comment acceptance rate | above 30% | GitHub thumbs-up reactions | | Dismissal rate | under 20% | Dismissed review comments | | False positive rate | under 15% | Manual audit of 50 comments/week | | Time to review | under 3 minutes | Webhook received β†’ comments posted | | Bug catch rate | measure over time | Bugs found in review vs merged |


MVP vs Production

MVP (1 week):

  • Webhook receiver (FastAPI)
  • Parse diff, call GPT-4o with full diff (no chunking)
  • Post all comments above 0.7 confidence
  • One repo, one language

Production:

  • Chunking for large PRs
  • Per-language prompts
  • Comment deduplication
  • Async queue (handle traffic spikes)
  • Dashboard: acceptance rate, dismissal rate, latency
  • Configuration per repo (severity thresholds, ignored paths)

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.