Ollama — Running Local LLMs in .NET Development
Use Ollama to run local large language models in .NET development: setup, integration with Semantic Kernel, model selection for clinical tasks, and when local models are appropriate vs. cloud APIs.
Why Ollama for .NET Development
Ollama runs LLMs locally on your machine — no API key, no network required.
Use Ollama when:
→ Developing AI features without incurring API costs
→ Data privacy: patient data must not leave the local machine during development
→ Testing AI features without requiring cloud credentials
→ Evaluating model behaviour with local data before committing to a cloud API
Ollama limitations:
→ Requires a decent GPU or CPU (llama3.2:3b runs on CPU; larger models need GPU)
→ Response quality is lower than GPT-4o for complex reasoning tasks
→ Not suitable for production clinical systems that require high accuracy
→ Cold start: first generation slow if model isn't cached
Models commonly used with Ollama:
llama3.2:3b → Small, fast, good for structured outputs, code generation
llama3.1:8b → Better reasoning, slower on CPU
mistral:7b → Good for summarisation
phi3:mini → Very small, surprisingly capable for simple tasksOllama Setup
# Install Ollama (Windows/macOS/Linux)
# Download from: https://ollama.com
# Pull a model:
ollama pull llama3.2
# Run the Ollama server (starts automatically on install):
ollama serve
# API available at: http://localhost:11434
# List available models:
ollama list
# Test the API directly:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "What is the therapeutic INR range for Warfarin?",
"stream": false
}'Ollama in Docker Compose (Development)
# docker-compose.dev.yml
services:
ollama:
image: ollama/ollama:latest
volumes:
- ollama-data:/root/.ollama # persist downloaded models
ports:
- "11434:11434"
# GPU support (if available):
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
# Pull the model on startup:
ollama-init:
image: ollama/ollama:latest
depends_on:
- ollama
entrypoint: /bin/sh -c "sleep 5 && ollama pull llama3.2 && ollama pull phi3:mini"
environment:
- OLLAMA_HOST=http://ollama:11434
volumes:
ollama-data:Semantic Kernel with Ollama
// NuGet: Microsoft.SemanticKernel.Connectors.Ollama (preview)
var kernel = Kernel.CreateBuilder()
.AddOllamaChatCompletion(
modelId: "llama3.2",
endpoint: new Uri("http://localhost:11434"))
.Build();
// For text embeddings (for RAG/vector search):
var kernel = Kernel.CreateBuilder()
.AddOllamaTextEmbeddingGeneration(
modelId: "nomic-embed-text",
endpoint: new Uri("http://localhost:11434"))
.Build();
// Environment-aware kernel: Ollama locally, Azure OpenAI in production
private static IKernelBuilder ConfigureAI(
IConfiguration config, IWebHostEnvironment env)
{
var kernelBuilder = Kernel.CreateBuilder();
if (env.IsDevelopment())
{
kernelBuilder.AddOllamaChatCompletion(
modelId: config["AI:OllamaModel"] ?? "llama3.2",
endpoint: new Uri(config["AI:OllamaEndpoint"] ?? "http://localhost:11434"));
}
else
{
kernelBuilder.AddAzureOpenAIChatCompletion(
deploymentName: config["AzureOpenAI:DeploymentName"]!,
endpoint: config["AzureOpenAI:Endpoint"]!,
apiKey: config["AzureOpenAI:ApiKey"]!);
}
return kernelBuilder;
}Prompt Engineering for Smaller Models
// Smaller local models need more explicit, structured prompts
// They don't handle vague or complex multi-step prompts as well as GPT-4o
// LESS EFFECTIVE with small models:
var prompt = "Tell me about this patient's prescription.";
// MORE EFFECTIVE with small models — explicit structure:
var prompt = """
Task: Summarise the prescription below.
Format: Respond with exactly 2 sentences.
Include: medication name, current dose, and any warnings.
Do not include any other information.
Prescription:
Medication: Warfarin
Dose: 5mg daily
Status: Approved
INR at approval: 2.3
Last INR: 2.1 (2 days ago)
""";
// For structured JSON output — use Semantic Kernel's structured output:
var executionSettings = new PromptExecutionSettings
{
ExtensionData = new Dictionary<string, object>
{
["format"] = "json" // request JSON output from the model
}
};
// Or use a schema prompt:
var jsonPrompt = """
Extract the medication information from the text below.
Respond ONLY with valid JSON matching this schema:
{ "medication": string, "dose_mg": number, "frequency": string }
Text: "The patient is on Warfarin 5mg taken once daily."
""";Comparing Local vs Cloud for Clinical Tasks
Task: summarise a prescription for a handover note
Llama3.2 (local): adequate — factual summary, consistent formatting
GPT-4o (Azure): excellent — nuanced, better clinical language
Decision: local for development, cloud for production
Task: detect potentially dangerous drug interactions from a free-text note
Llama3.2 (local): poor — misses subtle interactions, hallucinates references
GPT-4o (Azure): much better — still requires clinical validation
Decision: never rely on LLM alone for drug interaction checking
use a validated drug interaction database (BNF API, Lexicomp)
Task: extract structured data from a discharge letter PDF
Llama3.2 (local): workable for simple extractions
GPT-4o with vision: much better for complex PDF layouts with tables
Decision: local for prototyping, cloud for production extraction
Task: code generation and summarisation in development
Llama3.2 (local): excellent — fast, free, no data leaves the machine
Decision: Ollama only — no need for cloud API for development tasksProduction issue I've seen: A team used Ollama with llama3.1:8b for a clinical document summarisation feature in production (not development). The model ran on CPU — each summarisation took 45-90 seconds. Nurses stopped using the feature after 2 days. The model also occasionally produced "hallucinated" medication names that were close to real names but wrong (e.g., "Warfarinol" instead of "Warfarin"). In production, use cloud APIs for quality and latency guarantees. Reserve Ollama for local development where a 45-second response is acceptable and data privacy during development is the requirement.
Key Takeaway
Ollama runs LLMs locally — useful for development without API keys and for keeping patient data on the local machine. Use it with Semantic Kernel via
AddOllamaChatCompletion. Structure prompts explicitly for smaller models — they perform better with clear format instructions. Switch to Azure OpenAI in production: better quality, sub-second response times, SLA guarantees. Never use local models for safety-critical clinical decisions — neither local nor cloud models replace validated clinical databases for drug interaction checking or dosage decisions.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.