RAG Chatbot in .NET · Lesson 2 of 6
Text Embeddings — Vectors, Similarity, and Models
What Embeddings Are
An embedding is a list of numbers (a vector) that represents the meaning of text.
Similar meanings produce similar vectors — measured by cosine similarity.
Example:
"The patient's INR is 2.4"
→ [0.023, -0.441, 0.187, 0.902, ..., -0.031] (1536 numbers for ada-002)
"INR value of 2.4 was recorded"
→ [0.027, -0.438, 0.191, 0.899, ..., -0.029] (very similar vector)
"The weather is sunny today"
→ [0.712, 0.304, -0.553, 0.021, ..., 0.441] (very different vector)
Cosine similarity score:
INR sentence 1 vs INR sentence 2: 0.98 (nearly identical meaning)
INR sentence 1 vs weather: 0.12 (unrelated)
This is the foundation of RAG (Retrieval-Augmented Generation):
1. Embed all your documents → store vectors in a database
2. Embed the user's query → find documents with similar vectors
3. Pass the matching documents to the AI as context
4. AI answers using the retrieved documents, not training dataGenerating Embeddings with Semantic Kernel
// NuGet: Microsoft.SemanticKernel.Connectors.AzureOpenAI
// Setup:
var embeddingService = new AzureOpenAITextEmbeddingGenerationService(
deploymentName: "text-embedding-ada-002", // or text-embedding-3-small
endpoint: config["AzureOpenAI:Endpoint"]!,
apiKey: config["AzureOpenAI:ApiKey"]!);
// Generate a single embedding:
var embedding = await embeddingService.GenerateEmbeddingAsync(
"The patient's Warfarin dose is 5mg daily. INR target range: 2.0–3.0.",
kernel: null,
ct);
float[] vector = embedding.ToArray(); // 1536 floats for ada-002
// Generate embeddings for multiple texts (batch — more efficient):
var texts = new List<string>
{
"Patient MRN-001: Warfarin 5mg daily, INR target 2.0–3.0",
"Patient MRN-002: Apixaban 5mg twice daily, AF indication",
"Patient MRN-003: Rivaroxaban 20mg daily with evening meal, DVT prophylaxis"
};
var embeddings = await embeddingService.GenerateEmbeddingsAsync(texts, kernel: null, ct);
// Returns IList<ReadOnlyMemory<float>> — one per input text
// Batching reduces API calls:
// 100 documents → 1 batch call (max ~2048 items per Ada-002 request)
// vs 100 individual callsEmbedding Clinical Documents
// Service to embed and store clinical document chunks
public sealed class ClinicalDocumentEmbeddingService
{
private readonly ITextEmbeddingGenerationService _embeddings;
private readonly IVectorDocumentStore _vectorStore;
public async Task IndexDocumentAsync(
ClinicalDocument document,
CancellationToken ct)
{
// Chunk the document (see rag-chunking for strategies)
var chunks = ChunkDocument(document);
// Embed all chunks in one batch
var texts = chunks.Select(c => c.Text).ToList();
var embeddings = await _embeddings.GenerateEmbeddingsAsync(texts, null, ct);
// Store each chunk with its embedding
var documents = chunks.Zip(embeddings, (chunk, embedding) =>
new EmbeddedDocument(
Id: Guid.NewGuid(),
SourceId: document.Id,
SourceType: document.Type,
PatientMrn: document.PatientMrn,
ChunkIndex: chunk.Index,
Text: chunk.Text,
Embedding: embedding.ToArray(),
IndexedAt: DateTime.UtcNow));
await _vectorStore.UpsertBatchAsync(documents, ct);
}
private static IReadOnlyList<DocumentChunk> ChunkDocument(ClinicalDocument doc)
{
// Sliding window chunking: 500 tokens, 50-token overlap
const int ChunkSize = 500;
const int OverlapTokens = 50;
var words = doc.Content.Split(' ');
var chunks = new List<DocumentChunk>();
var index = 0;
var chunkIndex = 0;
while (index < words.Length)
{
var end = Math.Min(index + ChunkSize, words.Length);
var text = string.Join(" ", words[index..end]);
chunks.Add(new DocumentChunk(chunkIndex++, text));
index += ChunkSize - OverlapTokens;
}
return chunks;
}
}
public sealed record ClinicalDocument(
Guid Id,
string PatientMrn,
string Type, // "discharge_letter", "prescription_note", "clinical_guideline"
string Content);
public sealed record EmbeddedDocument(
Guid Id,
Guid SourceId,
string SourceType,
string PatientMrn,
int ChunkIndex,
string Text,
float[] Embedding,
DateTime IndexedAt);
public sealed record DocumentChunk(int Index, string Text);Choosing an Embedding Model
Model comparison for clinical .NET applications:
text-embedding-ada-002 (Azure OpenAI):
Dimensions: 1536
Max input: 8191 tokens
Cost: $0.0001 / 1K tokens
Quality: Good for English clinical text
Best for: General-purpose RAG, most production use cases
text-embedding-3-small (Azure OpenAI):
Dimensions: 1536 (default) or configurable
Max input: 8191 tokens
Cost: Lower than ada-002
Quality: Better than ada-002, especially for multilingual
Best for: New projects — prefer over ada-002
text-embedding-3-large:
Dimensions: 3072
Quality: Best available from OpenAI
Cost: Higher
Best for: High-accuracy RAG where quality matters more than cost
nomic-embed-text (Ollama — local):
Dimensions: 768
Cost: Free (local compute)
Quality: Good for development; lower than cloud models for clinical text
Best for: Local development only — do not use for production clinical RAG
Rule of thumb:
Start with text-embedding-3-small.
If retrieval quality is insufficient, upgrade to text-embedding-3-large.
Keep the same model for both indexing and query embedding — mixing models
produces garbage similarity scores.Embedding Model Registration in .NET
// Register the embedding service with DI:
builder.Services.AddSingleton<ITextEmbeddingGenerationService>(_ =>
new AzureOpenAITextEmbeddingGenerationService(
deploymentName: builder.Configuration["AzureOpenAI:EmbeddingDeployment"]!,
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
credential: new DefaultAzureCredential()));
// For local development with Ollama:
if (builder.Environment.IsDevelopment())
{
builder.Services.AddSingleton<ITextEmbeddingGenerationService>(_ =>
new OllamaTextEmbeddingGenerationService(
modelId: "nomic-embed-text",
endpoint: new Uri("http://localhost:11434")));
}
// Avoid re-embedding unchanged content — cache by content hash:
public sealed class CachingEmbeddingService(
ITextEmbeddingGenerationService inner,
IDistributedCache cache) : ITextEmbeddingGenerationService
{
public async Task<IList<ReadOnlyMemory<float>>> GenerateEmbeddingsAsync(
IList<string> data, Kernel? kernel = null, CancellationToken ct = default)
{
var results = new List<ReadOnlyMemory<float>>(data.Count);
foreach (var text in data)
{
var key = $"embedding:{ComputeHash(text)}";
var cached = await cache.GetStringAsync(key, ct);
if (cached is not null)
{
results.Add(JsonSerializer.Deserialize<float[]>(cached)!);
continue;
}
var embedding = (await inner.GenerateEmbeddingsAsync([text], kernel, ct))[0];
await cache.SetStringAsync(key,
JsonSerializer.Serialize(embedding.ToArray()),
new DistributedCacheEntryOptions { SlidingExpiration = TimeSpan.FromDays(7) },
ct);
results.Add(embedding);
}
return results;
}
private static string ComputeHash(string text)
{
var bytes = SHA256.HashData(Encoding.UTF8.GetBytes(text));
return Convert.ToHexString(bytes)[..16];
}
}Production issue I've seen: A team built a RAG system for clinical guidelines using
text-embedding-ada-002and indexed 2,000 documents. Six months later, they switched totext-embedding-3-smallfor new documents because the quality was better. The index now contained vectors from two different models — and similarity search would compare ada-002 vectors from old documents against text-embedding-3-small vectors from new ones. The scores were meaningless: some relevant old documents were never retrieved, and some irrelevant new ones scored highly. When you change embedding models, you must re-embed your entire existing index with the new model. There is no migration path — the vectors are incompatible.
Key Takeaway
Embeddings convert text into vectors where semantic similarity corresponds to mathematical proximity. Use
ITextEmbeddingGenerationServicein Semantic Kernel to generate embeddings, batch inputs for efficiency, and store vectors alongside the source text in a vector store. Usetext-embedding-3-smallfor most new .NET RAG applications. Always use the same model for both indexing and querying — mixing models produces incorrect similarity scores. Cache embeddings by content hash to avoid re-embedding unchanged documents.