Back to blog
AI Systemsintermediate

Skill 4 — RAG: Chunk, Embed & Index the Drug Knowledge Base

Build the retrieval-augmented generation pipeline: load 1,200 FDA drug records, chunk them intelligently, embed with Azure OpenAI, and index into Azure AI Search.

Asma Hafeez KhanMay 15, 20264 min read
RAGLangChainAzure OpenAIEmbeddingsVector SearchDrug Database
Share:𝕏

Why RAG — Not Just GPT-4o

GPT-4o has broad medical knowledge from training data, but:

  • Its knowledge has a cutoff date — new drug warnings are missed
  • It can hallucinate drug interactions with false confidence
  • It cannot cite a specific FDA label section

RAG fixes all three: we retrieve the actual drug label text and give it to GPT-4o as grounded context. The model answers from evidence, not memory.

Without RAG:  User question → GPT-4o → answer (from training memory, possibly wrong)

With RAG:     User question → Embed query → Search drug database
                                         → Retrieve top-3 relevant chunks
                                         → GPT-4o(question + chunks) → cited answer

The Drug Dataset

The repo ships with data/drugs.jsonl — 1,200 FDA drug label records in JSON Lines format:

JSON
{"drug_name": "Metformin", "ndc": "0093-1048", "section": "INDICATIONS AND USAGE", "text": "Metformin hydrochloride tablets are indicated as an adjunct to diet and exercise to improve glycemic control in adults and pediatric patients 10 years of age and older with type 2 diabetes mellitus."}
{"drug_name": "Warfarin", "ndc": "0056-0172", "section": "WARNINGS", "text": "Bleeding Risk: Warfarin sodium can cause major or fatal bleeding. Bleeding is more likely during the first month..."}

Each record has:

  • drug_name — canonical drug name
  • ndc — National Drug Code (unique identifier)
  • section — label section (WARNINGS, DOSAGE, INTERACTIONS, etc.)
  • text — the actual label content (100–800 tokens per record)

Step 1: Chunking

FDA label text can be long. We chunk into ~512-token passages with overlap so no important sentence gets cut at a boundary:

Python
# pharmabot/rag/chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import Iterator
import json

CHUNK_SIZE    = 512   # tokens (~400 words)
CHUNK_OVERLAP = 52    # ~10% overlap

splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " "],
)

def chunk_drug_records(jsonl_path: str) -> Iterator[dict]:
    with open(jsonl_path) as f:
        for line in f:
            record = json.loads(line)
            chunks = splitter.split_text(record["text"])
            for i, chunk in enumerate(chunks):
                yield {
                    "id":           f"{record['ndc']}-{record['section']}-{i}",
                    "drug_name":    record["drug_name"],
                    "ndc":          record["ndc"],
                    "section":      record["section"],
                    "chunk_index":  i,
                    "text":         chunk,
                }

Why RecursiveCharacterTextSplitter? It tries to split at paragraph breaks first, then sentences, then words — preserving semantic coherence better than splitting at fixed character counts.


Step 2: Embedding

Python
# pharmabot/rag/embedder.py
from openai import AsyncAzureOpenAI
from pharmabot.config import settings

class DrugEmbedder:
    def __init__(self):
        self.client = AsyncAzureOpenAI(
            api_key=settings.azure_openai_api_key,
            azure_endpoint=settings.azure_openai_endpoint,
            api_version="2024-02-01",
        )
        self.deployment = settings.azure_openai_embedding_deployment  # text-embedding-3-small

    async def embed(self, texts: list[str]) -> list[list[float]]:
        response = await self.client.embeddings.create(
            input=texts,
            model=self.deployment,
        )
        return [item.embedding for item in response.data]

    async def embed_single(self, text: str) -> list[float]:
        results = await self.embed([text])
        return results[0]

text-embedding-3-small produces 1,536-dimensional vectors. Cost: $0.02 per 1M tokens — embedding 1,200 records costs less than $0.10.


Step 3: Indexing into Azure AI Search

Python
# scripts/seed_knowledge_base.py
import asyncio
from pharmabot.rag.chunker import chunk_drug_records
from pharmabot.rag.embedder import DrugEmbedder
from azure.search.documents.aio import SearchClient
from azure.core.credentials import AzureKeyCredential
from pharmabot.config import settings

async def seed():
    embedder = DrugEmbedder()
    search_client = SearchClient(
        endpoint=settings.azure_search_endpoint,
        index_name=settings.azure_search_index,
        credential=AzureKeyCredential(settings.azure_search_api_key),
    )

    batch = []
    async with search_client:
        for chunk in chunk_drug_records("data/drugs.jsonl"):
            # Generate embedding
            vector = await embedder.embed_single(chunk["text"])
            chunk["content_vector"] = vector
            batch.append(chunk)

            # Upload in batches of 100
            if len(batch) == 100:
                await search_client.upload_documents(batch)
                print(f"Indexed {len(batch)} chunks")
                batch.clear()

        if batch:
            await search_client.upload_documents(batch)

    print("Drug knowledge base seeded successfully.")

if __name__ == "__main__":
    asyncio.run(seed())

Run once:

Bash
python scripts/seed_knowledge_base.py
# Takes 3-5 minutes for 1,200 records

Step 4: The RAG Pipeline

Python
# pharmabot/rag/pipeline.py
from pharmabot.rag.embedder import DrugEmbedder
from pharmabot.rag.retriever import HybridRetriever

class RAGPipeline:
    def __init__(self):
        self.embedder  = DrugEmbedder()
        self.retriever = HybridRetriever()

    async def retrieve(self, query: str, top_k: int = 3) -> list[dict]:
        query_vector = await self.embedder.embed_single(query)
        chunks = await self.retriever.search(query, query_vector, top_k=top_k)
        return chunks

    def format_context(self, chunks: list[dict]) -> str:
        parts = []
        for i, chunk in enumerate(chunks, 1):
            parts.append(
                f"[Source {i}: {chunk['drug_name']} — {chunk['section']}]\n{chunk['text']}"
            )
        return "\n\n---\n\n".join(parts)

Checkpoint

Test retrieval directly:

Bash
curl -X POST http://localhost:8000/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "metformin type 2 diabetes dosage", "top_k": 3}'

You should see 3 chunks from the drug database with relevance scores and the label section they came from. If scores are all 0 or chunks are empty, check your Azure Search index name in .env.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.