System Design: AI Code Review Assistant
Design an AI code review tool that automatically reviews pull requests β from GitHub webhook to LLM reviewer to posted comments. Covers diff parsing, chunking large PRs, and quality control.
The Interview Question
"Design an AI code review assistant that automatically reviews GitHub pull requests and posts constructive comments. It should catch bugs, style issues, and security vulnerabilities without being noisy."
Step 1: Clarify Requirements
- Trigger: On PR opened or updated (GitHub webhook)
- Languages: Python, TypeScript, C# (configurable per repo)
- Comment style: Inline comments on specific lines, plus a PR summary
- Quality bar: Only post comments when confidence is high β no spam
- Latency: Review should complete within 3 minutes of PR creation
- Scale: 50 repositories, up to 200 PRs/day
Step 2: Back-of-Envelope
200 PRs/day, average 300 lines changed per PR:
- 200 Γ 300 lines = 60,000 lines reviewed/day
- Average comment: 1 comment per 30 lines = ~2,000 comments/day (before filtering)
- After confidence filtering (50% pass): ~1,000 posted comments/day
Token estimate per PR:
- Diff context: ~2,000 tokens
- System prompt + instructions: ~500 tokens
- Response (comments): ~600 tokens
- Total: ~3,100 tokens per PR Γ 200 PRs = 620,000 tokens/day
At GPT-4o pricing: negligible. The challenge is latency and quality, not cost.
System Architecture
GitHub PR Event
β
βΌ webhook
βββββββββββββββββββ
β Webhook β
β Receiver β β Validates GitHub signature, enqueues job
ββββββββββ¬βββββββββ
β async
βΌ
βββββββββββββββββββ
β Review Queue β β Azure Service Bus or Redis queue
β (per-repo β
β prioritization)β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Review Worker β
β - Fetch diff β
β - Chunk diff β
β - Run LLM β
β - Filter β
β - Post commentsβ
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β GitHub API β β POST review comments
βββββββββββββββββββStep 3: Diff Parsing and Chunking
Large PRs can exceed the context window. The chunking strategy is critical:
# reviewer/diff_parser.py
from dataclasses import dataclass
import re
@dataclass
class DiffChunk:
file_path: str
language: str
start_line: int
end_line: int
content: str
added_lines: list[int] # which lines are additions
def parse_diff(raw_diff: str) -> list[DiffChunk]:
"""Parse unified diff into reviewable chunks."""
chunks = []
current_file = None
current_lines = []
current_start = 0
added_lines = []
for line in raw_diff.split("\n"):
if line.startswith("diff --git"):
if current_file and current_lines:
chunks.append(DiffChunk(
file_path=current_file,
language=detect_language(current_file),
start_line=current_start,
end_line=current_start + len(current_lines),
content="\n".join(current_lines),
added_lines=added_lines,
))
current_file = extract_file_path(line)
current_lines = []
added_lines = []
elif line.startswith("@@"):
# Parse hunk header: @@ -a,b +c,d @@
match = re.search(r"\+(\d+)", line)
if match:
current_start = int(match.group(1))
current_lines.append(line)
else:
if line.startswith("+") and not line.startswith("+++"):
added_lines.append(current_start + len(current_lines))
current_lines.append(line)
return chunks
def split_large_diff(chunks: list[DiffChunk], max_tokens: int = 2000) -> list[DiffChunk]:
"""Split chunks that are too large for a single LLM call."""
result = []
for chunk in chunks:
if estimate_tokens(chunk.content) <= max_tokens:
result.append(chunk)
else:
# Split by logical boundaries (function/class definitions)
sub_chunks = split_at_function_boundaries(chunk, max_tokens)
result.extend(sub_chunks)
return resultStep 4: The Review LLM Call
# reviewer/llm_reviewer.py
from openai import AsyncAzureOpenAI
from pydantic import BaseModel
class ReviewComment(BaseModel):
line: int
severity: str # "error" | "warning" | "suggestion"
category: str # "bug" | "security" | "style" | "performance"
comment: str
confidence: float # 0.0-1.0
class ReviewResult(BaseModel):
comments: list[ReviewComment]
summary: str
REVIEW_SYSTEM_PROMPT = """You are an expert code reviewer. Review the provided diff and identify:
1. **Bugs**: Logic errors, off-by-one, null dereference, incorrect conditions
2. **Security**: SQL injection, XSS, hardcoded secrets, insecure deserialization
3. **Performance**: N+1 queries, unnecessary loops, missing indexes
4. **Style**: Naming, complexity, missing error handling
RULES:
- Only comment on ADDED lines (lines starting with +)
- Only post a comment if confidence is above 0.7
- Be specific: reference the exact line and explain why it's a problem
- Suggest the fix, don't just identify the problem
- Do NOT comment on formatting/whitespace
- Maximum 5 comments per file
Respond as JSON matching the ReviewResult schema."""
async def review_chunk(
chunk: DiffChunk,
client: AsyncAzureOpenAI,
) -> ReviewResult:
messages = [
{"role": "system", "content": REVIEW_SYSTEM_PROMPT},
{
"role": "user",
"content": (
f"File: {chunk.file_path} ({chunk.language})\n"
f"Starting at line {chunk.start_line}\n\n"
f"```diff\n{chunk.content}\n```\n\n"
f"Review this diff. Added lines are: {chunk.added_lines}"
),
},
]
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"},
temperature=0.1,
)
return ReviewResult.model_validate_json(
response.choices[0].message.content
)Step 5: Confidence Filtering
Only post comments above the confidence threshold:
# reviewer/filter.py
CONFIDENCE_THRESHOLD = 0.75
MAX_COMMENTS_PER_PR = 20 # avoid spam
def filter_comments(
all_comments: list[ReviewComment],
threshold: float = CONFIDENCE_THRESHOLD,
) -> list[ReviewComment]:
# Filter by confidence
filtered = [c for c in all_comments if c.confidence >= threshold]
# Prioritise: errors > warnings > suggestions
priority = {"error": 0, "warning": 1, "suggestion": 2}
filtered.sort(key=lambda c: priority.get(c.severity, 3))
# Cap to avoid noise
return filtered[:MAX_COMMENTS_PER_PR]Step 6: Posting Comments to GitHub
# reviewer/github_client.py
import httpx
class GitHubReviewClient:
def __init__(self, token: str):
self.client = httpx.AsyncClient(
base_url="https://api.github.com",
headers={
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github.v3+json",
},
)
async def post_review(
self,
repo: str,
pr_number: int,
commit_sha: str,
comments: list[ReviewComment],
summary: str,
):
# Build review with inline comments
review_comments = [
{
"path": c.file_path,
"line": c.line,
"body": f"**[{c.severity.upper()}]** {c.comment}",
}
for c in comments
]
await self.client.post(
f"/repos/{repo}/pulls/{pr_number}/reviews",
json={
"commit_id": commit_sha,
"body": f"## AI Code Review Summary\n\n{summary}",
"event": "COMMENT", # Don't approve or request changes
"comments": review_comments,
},
)Step 7: Avoiding Noise
The biggest failure mode for code review bots is being too noisy. Engineers mute them.
Rules to reduce noise:
- Confidence threshold 0.75+: Only post when reasonably certain
- Max 20 comments per PR: Hard cap, prioritise errors
- Deduplication: Don't comment on the same pattern twice in one PR
- Suppress on draft PRs: Only review when PR is marked ready
- Ignore generated files: Skip
*.generated.cs,package-lock.json, migrations - Learn from dismissals: If engineers dismiss comments, reduce confidence for that category
IGNORED_PATHS = [
"*.generated.*",
"package-lock.json",
"yarn.lock",
"migrations/",
"*.min.js",
]
def should_review_file(file_path: str) -> bool:
return not any(
fnmatch.fnmatch(file_path, pattern)
for pattern in IGNORED_PATHS
)Step 8: Quality Metrics
Track these to know if the bot is useful:
| Metric | Target | How to Measure | |---|---|---| | Comment acceptance rate | above 30% | GitHub thumbs-up reactions | | Dismissal rate | under 20% | Dismissed review comments | | False positive rate | under 15% | Manual audit of 50 comments/week | | Time to review | under 3 minutes | Webhook received β comments posted | | Bug catch rate | measure over time | Bugs found in review vs merged |
MVP vs Production
MVP (1 week):
- Webhook receiver (FastAPI)
- Parse diff, call GPT-4o with full diff (no chunking)
- Post all comments above 0.7 confidence
- One repo, one language
Production:
- Chunking for large PRs
- Per-language prompts
- Comment deduplication
- Async queue (handle traffic spikes)
- Dashboard: acceptance rate, dismissal rate, latency
- Configuration per repo (severity thresholds, ignored paths)
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.