GraphRAG

The Limitation of Vector RAG

Standard RAG retrieves documents similar to the query — a local, proximity-based search:

Query: "What drugs interact with Warfarin through CYP2C9 inhibition,
        and which of those also affect carbamazepine metabolism?"

Standard RAG:
  Retrieves documents about "Warfarin CYP2C9 interactions"
  May not retrieve documents connecting CYP2C9 inhibitors to carbamazepine
  The connection requires MULTI-HOP reasoning:
    Drug X → inhibits CYP2C9 → increases Warfarin → also inhibits CYP3A4 → affects carbamazepine

Vector similarity can't traverse a multi-hop chain.
The answer requires connecting entities across multiple documents.

GraphRAG Architecture

GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus:

Phase 1: Graph Construction (offline)

  1. Entity extraction:
     LLM reads each document chunk and extracts entities
     Entities: Warfarin, CYP2C9, clarithromycin, atrial fibrillation...

  2. Relationship extraction:
     LLM extracts relationships between entities
     (clarithromycin) --[inhibits]--> (CYP2C9)
     (Warfarin) --[metabolised_by]--> (CYP2C9)
     (Warfarin) --[treats]--> (atrial fibrillation)

  3. Community detection:
     Cluster related entities into "communities" using graph algorithms
     Community 1: anticoagulation (Warfarin, INR, bleeding risk, reversal)
     Community 2: CYP450 interactions (CYP2C9, CYP3A4, inhibitors, inducers)

  4. Community summaries:
     LLM writes a summary of each community
     "This community covers Warfarin metabolism via CYP2C9, including
      how inhibitors like clarithromycin and fluconazole increase INR..."

Phase 2: Retrieval (runtime)
  For "local" queries: traverse graph from relevant entities
  For "global" queries: retrieve and combine community summaries

Simplified Implementation

Python

from anthropic import Anthropic
from dataclasses import dataclass, field
import json

client = Anthropic()

@dataclass
class Entity:
    name: str
    entity_type: str  # Drug, Enzyme, Disease, etc.

@dataclass
class Relationship:
    source: str
    target: str
    relation: str
    document_source: str

@dataclass
class KnowledgeGraph:
    entities: dict[str, Entity] = field(default_factory=dict)
    relationships: list[Relationship] = field(default_factory=list)

def extract_graph_from_chunk(text: str) -> dict:
    """Extract entities and relationships from a text chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        messages=[{"role": "user", "content":
            f"""Extract all named entities and relationships from this text.
Return JSON with this structure:
{{
  "entities": [{{"name": string, "type": string}}],
  "relationships": [{{"source": string, "target": string, "relation": string}}]
}}

Text:
{text}"""}]
    )
    return json.loads(response.content[0].text)

def multi_hop_query(
    graph: KnowledgeGraph,
    start_entity: str,
    max_hops: int = 2
) -> list[str]:
    """Return all entities reachable within max_hops from start_entity."""
    visited = {start_entity}
    frontier = {start_entity}

    for _ in range(max_hops):
        next_frontier = set()
        for rel in graph.relationships:
            if rel.source in frontier and rel.target not in visited:
                next_frontier.add(rel.target)
                visited.add(rel.target)
            if rel.target in frontier and rel.source not in visited:
                next_frontier.add(rel.source)
                visited.add(rel.source)
        frontier = next_frontier

    return list(visited - {start_entity})

Community Summary RAG

For global queries (themes, overviews):

Python

def global_query(
    query: str,
    community_summaries: list[str],
    llm_client
) -> str:
    """Answer a global query using community summaries."""
    summaries_text = "\n\n---\n\n".join(
        f"Community {i+1}:\n{s}" for i, s in enumerate(community_summaries[:10])
    )

    response = llm_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        messages=[{"role": "user", "content":
            f"""Answer the following question using the community knowledge summaries provided.
Combine insights across communities as needed.

Community summaries:
{summaries_text}

Question: {query}"""}]
    )
    return response.content[0].text

GraphRAG vs Standard RAG

| Property | Standard RAG | GraphRAG | |----------|-------------|---------| | Retrieval type | Semantic similarity | Graph traversal + similarity | | Multi-hop reasoning | Poor | Strong | | Global queries (themes, overviews) | Poor | Strong (community summaries) | | Local factual queries | Good | Good | | Setup complexity | Low | High | | Storage | Vectors only | Vectors + graph + summaries | | Cost | Low | High (LLM-based extraction) | | Update frequency | Easy (re-embed new docs) | Hard (rebuild graph) |

When to Use GraphRAG

Use GraphRAG:
  Questions requiring multi-hop reasoning
  "Which drugs share both CYP2C9 and CYP3A4 inhibition AND are prescribed for infections?"
  Global synthesis questions: "What are the major themes in our clinical guideline corpus?"
  Entity-centric knowledge bases (drug interactions, disease networks)

Don't use GraphRAG:
  Simple document Q&A
  High-update corpora (graph rebuild is expensive)
  Latency-sensitive applications
  Small corpora (overhead not worth it)

Interview Answer

"GraphRAG builds a knowledge graph of entities and relationships from the document corpus, enabling multi-hop reasoning that vector retrieval can't support. It has two phases: offline graph construction (entity/relationship extraction via LLM, community detection, community summary generation) and runtime retrieval (local queries traverse the graph; global queries aggregate community summaries). Microsoft's GraphRAG paper shows significant improvements on global analytical queries — 'what are the major themes?' and 'which drugs share multiple interaction pathways?' Standard RAG can't answer these because they require connecting entities across multiple documents. The trade-off: high setup cost, expensive to update, not justified for simple factual retrieval."