Skill 4 — RAG: Chunk, Embed & Index the Drug Knowledge Base
Build the retrieval-augmented generation pipeline: load 1,200 FDA drug records, chunk them intelligently, embed with Azure OpenAI, and index into Azure AI Search.
Why RAG — Not Just GPT-4o
GPT-4o has broad medical knowledge from training data, but:
- Its knowledge has a cutoff date — new drug warnings are missed
- It can hallucinate drug interactions with false confidence
- It cannot cite a specific FDA label section
RAG fixes all three: we retrieve the actual drug label text and give it to GPT-4o as grounded context. The model answers from evidence, not memory.
Without RAG: User question → GPT-4o → answer (from training memory, possibly wrong)
With RAG: User question → Embed query → Search drug database
→ Retrieve top-3 relevant chunks
→ GPT-4o(question + chunks) → cited answerThe Drug Dataset
The repo ships with data/drugs.jsonl — 1,200 FDA drug label records in JSON Lines format:
{"drug_name": "Metformin", "ndc": "0093-1048", "section": "INDICATIONS AND USAGE", "text": "Metformin hydrochloride tablets are indicated as an adjunct to diet and exercise to improve glycemic control in adults and pediatric patients 10 years of age and older with type 2 diabetes mellitus."}
{"drug_name": "Warfarin", "ndc": "0056-0172", "section": "WARNINGS", "text": "Bleeding Risk: Warfarin sodium can cause major or fatal bleeding. Bleeding is more likely during the first month..."}Each record has:
drug_name— canonical drug namendc— National Drug Code (unique identifier)section— label section (WARNINGS, DOSAGE, INTERACTIONS, etc.)text— the actual label content (100–800 tokens per record)
Step 1: Chunking
FDA label text can be long. We chunk into ~512-token passages with overlap so no important sentence gets cut at a boundary:
# pharmabot/rag/chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import Iterator
import json
CHUNK_SIZE = 512 # tokens (~400 words)
CHUNK_OVERLAP = 52 # ~10% overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " "],
)
def chunk_drug_records(jsonl_path: str) -> Iterator[dict]:
with open(jsonl_path) as f:
for line in f:
record = json.loads(line)
chunks = splitter.split_text(record["text"])
for i, chunk in enumerate(chunks):
yield {
"id": f"{record['ndc']}-{record['section']}-{i}",
"drug_name": record["drug_name"],
"ndc": record["ndc"],
"section": record["section"],
"chunk_index": i,
"text": chunk,
}Why RecursiveCharacterTextSplitter? It tries to split at paragraph breaks first, then sentences, then words — preserving semantic coherence better than splitting at fixed character counts.
Step 2: Embedding
# pharmabot/rag/embedder.py
from openai import AsyncAzureOpenAI
from pharmabot.config import settings
class DrugEmbedder:
def __init__(self):
self.client = AsyncAzureOpenAI(
api_key=settings.azure_openai_api_key,
azure_endpoint=settings.azure_openai_endpoint,
api_version="2024-02-01",
)
self.deployment = settings.azure_openai_embedding_deployment # text-embedding-3-small
async def embed(self, texts: list[str]) -> list[list[float]]:
response = await self.client.embeddings.create(
input=texts,
model=self.deployment,
)
return [item.embedding for item in response.data]
async def embed_single(self, text: str) -> list[float]:
results = await self.embed([text])
return results[0]text-embedding-3-small produces 1,536-dimensional vectors. Cost: $0.02 per 1M tokens — embedding 1,200 records costs less than $0.10.
Step 3: Indexing into Azure AI Search
# scripts/seed_knowledge_base.py
import asyncio
from pharmabot.rag.chunker import chunk_drug_records
from pharmabot.rag.embedder import DrugEmbedder
from azure.search.documents.aio import SearchClient
from azure.core.credentials import AzureKeyCredential
from pharmabot.config import settings
async def seed():
embedder = DrugEmbedder()
search_client = SearchClient(
endpoint=settings.azure_search_endpoint,
index_name=settings.azure_search_index,
credential=AzureKeyCredential(settings.azure_search_api_key),
)
batch = []
async with search_client:
for chunk in chunk_drug_records("data/drugs.jsonl"):
# Generate embedding
vector = await embedder.embed_single(chunk["text"])
chunk["content_vector"] = vector
batch.append(chunk)
# Upload in batches of 100
if len(batch) == 100:
await search_client.upload_documents(batch)
print(f"Indexed {len(batch)} chunks")
batch.clear()
if batch:
await search_client.upload_documents(batch)
print("Drug knowledge base seeded successfully.")
if __name__ == "__main__":
asyncio.run(seed())Run once:
python scripts/seed_knowledge_base.py
# Takes 3-5 minutes for 1,200 recordsStep 4: The RAG Pipeline
# pharmabot/rag/pipeline.py
from pharmabot.rag.embedder import DrugEmbedder
from pharmabot.rag.retriever import HybridRetriever
class RAGPipeline:
def __init__(self):
self.embedder = DrugEmbedder()
self.retriever = HybridRetriever()
async def retrieve(self, query: str, top_k: int = 3) -> list[dict]:
query_vector = await self.embedder.embed_single(query)
chunks = await self.retriever.search(query, query_vector, top_k=top_k)
return chunks
def format_context(self, chunks: list[dict]) -> str:
parts = []
for i, chunk in enumerate(chunks, 1):
parts.append(
f"[Source {i}: {chunk['drug_name']} — {chunk['section']}]\n{chunk['text']}"
)
return "\n\n---\n\n".join(parts)Checkpoint
Test retrieval directly:
curl -X POST http://localhost:8000/api/search \
-H "Content-Type: application/json" \
-d '{"query": "metformin type 2 diabetes dosage", "top_k": 3}'You should see 3 chunks from the drug database with relevance scores and the label section they came from. If scores are all 0 or chunks are empty, check your Azure Search index name in .env.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.