System Design: AI Code Review Assistant

The Interview Question

"Design an AI code review assistant that automatically reviews GitHub pull requests and posts constructive comments. It should catch bugs, style issues, and security vulnerabilities without being noisy."

Step 1: Clarify Requirements

Trigger: On PR opened or updated (GitHub webhook)
Languages: Python, TypeScript, C# (configurable per repo)
Comment style: Inline comments on specific lines, plus a PR summary
Quality bar: Only post comments when confidence is high — no spam
Latency: Review should complete within 3 minutes of PR creation
Scale: 50 repositories, up to 200 PRs/day

Step 2: Back-of-Envelope

200 PRs/day, average 300 lines changed per PR:

200 × 300 lines = 60,000 lines reviewed/day
Average comment: 1 comment per 30 lines = ~2,000 comments/day (before filtering)
After confidence filtering (50% pass): ~1,000 posted comments/day

Token estimate per PR:

Diff context: ~2,000 tokens
System prompt + instructions: ~500 tokens
Response (comments): ~600 tokens
Total: ~3,100 tokens per PR × 200 PRs = 620,000 tokens/day

At GPT-4o pricing: negligible. The challenge is latency and quality, not cost.

System Architecture

GitHub PR Event
     │
     ▼ webhook
┌─────────────────┐
│  Webhook        │
│  Receiver       │  ← Validates GitHub signature, enqueues job
└────────┬────────┘
         │ async
         ▼
┌─────────────────┐
│  Review Queue   │  ← Azure Service Bus or Redis queue
│  (per-repo      │
│   prioritization)│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Review Worker  │
│  - Fetch diff   │
│  - Chunk diff   │
│  - Run LLM      │
│  - Filter       │
│  - Post comments│
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  GitHub API     │  ← POST review comments
└─────────────────┘

Step 3: Diff Parsing and Chunking

Large PRs can exceed the context window. The chunking strategy is critical:

Python

# reviewer/diff_parser.py
from dataclasses import dataclass
import re

@dataclass
class DiffChunk:
    file_path: str
    language: str
    start_line: int
    end_line: int
    content: str
    added_lines: list[int]  # which lines are additions

def parse_diff(raw_diff: str) -> list[DiffChunk]:
    """Parse unified diff into reviewable chunks."""
    chunks = []
    current_file = None
    current_lines = []
    current_start = 0
    added_lines = []

    for line in raw_diff.split("\n"):
        if line.startswith("diff --git"):
            if current_file and current_lines:
                chunks.append(DiffChunk(
                    file_path=current_file,
                    language=detect_language(current_file),
                    start_line=current_start,
                    end_line=current_start + len(current_lines),
                    content="\n".join(current_lines),
                    added_lines=added_lines,
                ))
            current_file = extract_file_path(line)
            current_lines = []
            added_lines = []

        elif line.startswith("@@"):
            # Parse hunk header: @@ -a,b +c,d @@
            match = re.search(r"\+(\d+)", line)
            if match:
                current_start = int(match.group(1))
            current_lines.append(line)

        else:
            if line.startswith("+") and not line.startswith("+++"):
                added_lines.append(current_start + len(current_lines))
            current_lines.append(line)

    return chunks


def split_large_diff(chunks: list[DiffChunk], max_tokens: int = 2000) -> list[DiffChunk]:
    """Split chunks that are too large for a single LLM call."""
    result = []
    for chunk in chunks:
        if estimate_tokens(chunk.content) <= max_tokens:
            result.append(chunk)
        else:
            # Split by logical boundaries (function/class definitions)
            sub_chunks = split_at_function_boundaries(chunk, max_tokens)
            result.extend(sub_chunks)
    return result

Step 4: The Review LLM Call

Python

# reviewer/llm_reviewer.py
from openai import AsyncAzureOpenAI
from pydantic import BaseModel

class ReviewComment(BaseModel):
    line: int
    severity: str  # "error" | "warning" | "suggestion"
    category: str  # "bug" | "security" | "style" | "performance"
    comment: str
    confidence: float  # 0.0-1.0

class ReviewResult(BaseModel):
    comments: list[ReviewComment]
    summary: str

REVIEW_SYSTEM_PROMPT = """You are an expert code reviewer. Review the provided diff and identify:

1. **Bugs**: Logic errors, off-by-one, null dereference, incorrect conditions
2. **Security**: SQL injection, XSS, hardcoded secrets, insecure deserialization
3. **Performance**: N+1 queries, unnecessary loops, missing indexes
4. **Style**: Naming, complexity, missing error handling

RULES:
- Only comment on ADDED lines (lines starting with +)
- Only post a comment if confidence is above 0.7
- Be specific: reference the exact line and explain why it's a problem
- Suggest the fix, don't just identify the problem
- Do NOT comment on formatting/whitespace
- Maximum 5 comments per file

Respond as JSON matching the ReviewResult schema."""

async def review_chunk(
    chunk: DiffChunk,
    client: AsyncAzureOpenAI,
) -> ReviewResult:
    messages = [
        {"role": "system", "content": REVIEW_SYSTEM_PROMPT},
        {
            "role": "user",
            "content": (
                f"File: {chunk.file_path} ({chunk.language})\n"
                f"Starting at line {chunk.start_line}\n\n"
                f"```diff\n{chunk.content}\n```\n\n"
                f"Review this diff. Added lines are: {chunk.added_lines}"
            ),
        },
    ]

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0.1,
    )

    return ReviewResult.model_validate_json(
        response.choices[0].message.content
    )

Step 5: Confidence Filtering

Only post comments above the confidence threshold:

Python

# reviewer/filter.py
CONFIDENCE_THRESHOLD = 0.75
MAX_COMMENTS_PER_PR = 20  # avoid spam

def filter_comments(
    all_comments: list[ReviewComment],
    threshold: float = CONFIDENCE_THRESHOLD,
) -> list[ReviewComment]:
    # Filter by confidence
    filtered = [c for c in all_comments if c.confidence >= threshold]

    # Prioritise: errors > warnings > suggestions
    priority = {"error": 0, "warning": 1, "suggestion": 2}
    filtered.sort(key=lambda c: priority.get(c.severity, 3))

    # Cap to avoid noise
    return filtered[:MAX_COMMENTS_PER_PR]

Step 6: Posting Comments to GitHub

Python

# reviewer/github_client.py
import httpx

class GitHubReviewClient:
    def __init__(self, token: str):
        self.client = httpx.AsyncClient(
            base_url="https://api.github.com",
            headers={
                "Authorization": f"Bearer {token}",
                "Accept": "application/vnd.github.v3+json",
            },
        )

    async def post_review(
        self,
        repo: str,
        pr_number: int,
        commit_sha: str,
        comments: list[ReviewComment],
        summary: str,
    ):
        # Build review with inline comments
        review_comments = [
            {
                "path": c.file_path,
                "line": c.line,
                "body": f"**[{c.severity.upper()}]** {c.comment}",
            }
            for c in comments
        ]

        await self.client.post(
            f"/repos/{repo}/pulls/{pr_number}/reviews",
            json={
                "commit_id": commit_sha,
                "body": f"## AI Code Review Summary\n\n{summary}",
                "event": "COMMENT",  # Don't approve or request changes
                "comments": review_comments,
            },
        )

Step 7: Avoiding Noise

The biggest failure mode for code review bots is being too noisy. Engineers mute them.

Rules to reduce noise:

Confidence threshold 0.75+: Only post when reasonably certain
Max 20 comments per PR: Hard cap, prioritise errors
Deduplication: Don't comment on the same pattern twice in one PR
Suppress on draft PRs: Only review when PR is marked ready
Ignore generated files: Skip *.generated.cs, package-lock.json, migrations
Learn from dismissals: If engineers dismiss comments, reduce confidence for that category

Python

IGNORED_PATHS = [
    "*.generated.*",
    "package-lock.json",
    "yarn.lock",
    "migrations/",
    "*.min.js",
]

def should_review_file(file_path: str) -> bool:
    return not any(
        fnmatch.fnmatch(file_path, pattern)
        for pattern in IGNORED_PATHS
    )

Step 8: Quality Metrics

Track these to know if the bot is useful:

| Metric | Target | How to Measure | |---|---|---| | Comment acceptance rate | above 30% | GitHub thumbs-up reactions | | Dismissal rate | under 20% | Dismissed review comments | | False positive rate | under 15% | Manual audit of 50 comments/week | | Time to review | under 3 minutes | Webhook received → comments posted | | Bug catch rate | measure over time | Bugs found in review vs merged |

MVP vs Production

MVP (1 week):

Webhook receiver (FastAPI)
Parse diff, call GPT-4o with full diff (no chunking)
Post all comments above 0.7 confidence
One repo, one language

Production:

Chunking for large PRs
Per-language prompts
Comment deduplication
Async queue (handle traffic spikes)
Dashboard: acceptance rate, dismissal rate, latency
Configuration per repo (severity thresholds, ignored paths)

System Design: AI Code Review Assistant

The Interview Question

Step 1: Clarify Requirements

Step 2: Back-of-Envelope

System Architecture

Step 3: Diff Parsing and Chunking

Step 4: The Review LLM Call

Step 5: Confidence Filtering

Step 6: Posting Comments to GitHub

Step 7: Avoiding Noise

Step 8: Quality Metrics

MVP vs Production

Enjoyed this article?

Leave a comment